Tutorial (Text Data Processing)#

(Last updated: Feb 27, 2024)1

This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the Natural Language Processing (NLP) pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.

The AG’s News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification.

You can use the following links to jump to the tasks and assignments:

Scenario#

The AG’s News Topic Classification Dataset is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end, make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks.

Import Packages#

Important

To make this notebook work, you need to install PyTorch. You can also copy this notebook (as well as the dataset) to Google Colab and run the notebook on it. You also need to install the packages in this link in your Python development environment.

We put all the packages that are needed for this tutorial below:

import os
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import torch
import torch.nn as nn
import torch.optim as optim

from gensim.models import Word2Vec

from nltk.corpus import stopwords, wordnet
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, confusion_matrix

from tqdm.notebook import tqdm

from wordcloud import WordCloud

from xml.sax import saxutils as su

# Add tqdm functions to pandas.
tqdm.pandas()

Task Answers#

The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.

def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.

    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert df_answer.equals(df_result)
        print(f"Test case {n} passed.")
    except Exception:
        print(f"Test case {n} failed.")
        print("Your output is:")
        display(df_result)
        print("Expected output is:")
        display(df_answer)


def check_answer_np(arr_result, arr_answer, n=1):
    """
    This function checks if two output numpy arrays are the same.

    Parameters
    ----------
    arr_result : numpy.ndarray
        The result from the output of a function.
    arr_answer: numpy.ndarray
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert np.array_equal(arr_result, arr_answer)
        print(f"Test case {n} passed.")
    except Exception:
        print(f"Test case {n} failed.")
        print("Your output is:")
        print(arr_result)
        print("Expected output is:")
        print(arr_answer)


def answer_tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "text" column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added "tokens" column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Apply the tokenizer to create the tokens column.
    df["tokens"] = df["text"].progress_apply(word_tokenize)

    # Apply the lemmatizer on every word in the tokens list.
    df["tokens"] = df["tokens"].progress_apply(
        lambda tokens: [lemmatizer.lemmatize(token, wordnet_pos(tag)) for token, tag in nltk.pos_tag(tokens)]
    )

    return df


def answer_get_word_counts(df, token_col="tokens"):
    """
    Generate dataframes with the word counts for each class in the data.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "class" and "tokens" columns.
    token_col: str
        Name of the column that stores the tokens.

    Returns
    -------
    pandas.DataFrame:
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # We need to filter out non-words.
    # Notice that the token column contains an array of tokens (not just one token).
    df[token_col] = df[token_col].apply(lambda tokens: [token.lower() for token in tokens if token.isalpha()])

    # Each item in the token column contains an array, which cannot be used directly.
    # Our goal is to count the tokens.
    # Thus, we need to explode the tokens so that every token gets its own row.
    # Then, at the later step, we can group the tokens and count them.
    df = df.explode(token_col)

    # Option 1:
    # - First, perform the groupby function on class and token.
    # - Then, get the size of how many rows per token (i.e., token counts).
    # - Finally, add the counts as a new column.
    counts = df.groupby(["class", token_col]).size().reset_index(name="count")

    # Option 2 has a similar logic but uses the pivot_table function.
    # counts = counts.pivot_table(index=["class", "tokens"], aggfunc="size").reset_index(name="count")

    # Sort the values on the class and count.
    counts = counts.sort_values(["class", "count"], ascending=[True, False])

    return counts


def answer_remove_stopwords(df):
    """
    Remove stopwords from the tokens.

    Parameters
    ----------
    df : pandas.DataFrame
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the stopwords rows removed.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Using a set for quicker lookups.
    stopwords_set = set(stopwords_list)

    # Filter stopwords from tokens.
    df = df[~df["tokens"].isin(stopwords_set)]

    return df


def answer_get_index_of_top_n_items(array, n=3):
    """
    Given an NumPy array, return the indexes of the top "n" number of items according to their values.

    Parameters
    ----------
    array : numpy.ndarray
        A 1D NumPy array.
    n : int
        The top "n" number of items that we want.

    Returns
    -------
    numpy.ndarray
        The indexes of the top "n" items.
    """
    return array.argsort()[:-n-1:-1]

Task 3: Preprocess Text Data#

In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

For performance reasons, we will only use a small subset of this dataset.

df_train = df_train.groupby("Class Index").head(1000)
df_test = df_test.groupby("Class Index").head(100)
display(df_train, df_test)
Class Index Title Description
0 3 Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindli...
1 3 Carlyle Looks Toward Commercial Aerospace (Reu... Reuters - Private investment firm Carlyle Grou...
2 3 Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\ab...
3 3 Iraq Halts Oil Exports from Main Southern Pipe... Reuters - Authorities have halted oil export\f...
4 3 Oil prices soar to all-time record, posing new... AFP - Tearaway world oil prices, toppling reco...
... ... ... ...
4865 2 IOA sets up committee to probe dope scandal Athens, Aug 20. (PTI):In a belated damage-cont...
4866 2 FACTBOX-Jonathan Woodgate factbox MADRID, Aug 20 (Reuters) - Factbox on England ...
4867 2 British canoe pair lose out ATHENS (Reuters) - Slovakian twins Peter and P...
4895 2 U.S. Softball Team Posts Shutout No. 7 (AP) AP - Cat Osterman struck out 10 in six innings...
4896 2 Paul Hamm's example PAUL HAMM'S fall and rise are what make the Ol...

4000 rows × 3 columns

Class Index Title Description
0 3 Fears for T N pension after talks Unions representing workers at Turner Newall...
1 4 The Race is On: Second Private Team Sets Launc... SPACE.com - TORONTO, Canada -- A second\team o...
2 4 Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry research...
3 4 Prediction Unit Helps Forecast Wildfires (AP) AP - It's barely dawn when Mike Fitzpatrick st...
4 4 Calif. Aims to Limit Farm-Related Smog (AP) AP - Southern California's smog-fighting agenc...
... ... ... ...
473 3 New Overtime Rules Take Effect New Bush administration rules that scale back ...
479 3 Dollar Holds Gains, Fed Comments Help TOKYO (Reuters) - The dollar held on to the p...
481 3 Dark arts of spin evident in phoney war for Abbey THE phoney war over the fate of Abbey grinds o...
482 3 Controversial US Overtime Rules Take Effect New overtime rules have taken effect in the Un...
484 3 SAS Braathens to cut Gatwick, Geneva flights blackhawk writes quot;SAS Braathens, the Norw...

400 rows × 3 columns

As you can see, all the classes are distributed evenly in the train and test data.

display(df_train["Class Index"].value_counts(), df_test["Class Index"].value_counts())
Class Index
3    1000
4    1000
2    1000
1    1000
Name: count, dtype: int64
Class Index
3    100
4    100
2    100
1    100
Name: count, dtype: int64

To make the data more understandable, we will make the classes more understandable by adding a class column from the original Class Index column, containing the category of the news article. To process both the title and news text together, we will combine the Title and Description columns into one text column. We will deal with just the train data until the point where we need the test data again.

def reformat_data(df):
    """
    Reformat the Class Index column to a Class column and combine
    the Title and Description columns into a Text column.
    Select only the class_idx, class and text columns afterwards.

    Parameters
    ----------
    df : pandas.DataFrame
        The original dataframe.

    Returns
    -------
    pandas.DataFrame
        The reformatted dataframe.
    """
    # Make the class column using a dictionary.
    df = df.rename(columns={"Class Index": "class_idx"})
    classes = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}
    df["class"] = df["class_idx"].apply(classes.get)

    # Use string concatonation for the Text column and unescape html characters.
    df["text"] = (df["Title"] + " " + df["Description"]).apply(su.unescape)

    # Select only the class_idx, class, and text column.
    df = df[["class_idx", "class", "text"]]
    return df
df_train_reformat = reformat_data(df_train)
display(df_train_reformat)
class_idx class text
0 3 Business Wall St. Bears Claw Back Into the Black (Reute...
1 3 Business Carlyle Looks Toward Commercial Aerospace (Reu...
2 3 Business Oil and Economy Cloud Stocks' Outlook (Reuters...
3 3 Business Iraq Halts Oil Exports from Main Southern Pipe...
4 3 Business Oil prices soar to all-time record, posing new...
... ... ... ...
4865 2 Sports IOA sets up committee to probe dope scandal At...
4866 2 Sports FACTBOX-Jonathan Woodgate factbox MADRID, Aug ...
4867 2 Sports British canoe pair lose out ATHENS (Reuters) -...
4895 2 Sports U.S. Softball Team Posts Shutout No. 7 (AP) AP...
4896 2 Sports Paul Hamm's example PAUL HAMM'S fall and rise ...

4000 rows × 3 columns

Tokenization#

Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. It helps language models to understand and analyze text data by breaking it down into smaller, more manageable pieces. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing NLP applications.

For example, in languages like English, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like “can’t” and hyphenated words like “self-driving”. And in Dutch, where multiple nouns can be combined into one bigger noun without any delimiter this can be hard. How would you tokenize “hippopotomonstrosesquippedaliofobie”? In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult.

To illustrate the use of tokenization, let’s consider the following example, which tokenizes a sample text using the word_tokenize function from the NLTK package. That function uses a pre-trained tokenization model for English.

# Sample text.
text = "The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day."

# Tokenize the text.
tokens = word_tokenize(text)

# Print the text and the tokens.
print("Original text:", text)
print("Tokenized text:", tokens)
Original text: The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
Tokenized text: ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cats', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Part-of-speech tagging#

Part-of-speech (POS) tagging is the process of assigning each word in a text corpus with a specific part-of-speech tag based on its context and definition. The tags typically include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and more. POS tagging can help other NLP tasks disambiguate a token somewhat due to the added context.

pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.'), ('The', 'DT'), ('cats', 'NNS'), ('could', 'MD'), ("n't", 'RB'), ('wait', 'VB'), ('to', 'TO'), ('sleep', 'VB'), ('all', 'DT'), ('day', 'NN'), ('.', '.')]

Stemming / Lemmatization#

Stemming and lemmatization are two common techniques used in NLP to preprocess and normalize text data. Both techniques involve transforming words into their root form, but they differ in their approach and the level of normalization they provide.

Stemming is a technique that involves reducing words to their base or stem form by removing any affixes or suffixes. For example, the stem of the word “lazily” would be “lazi”. Stemming is a simple and fast technique that can be useful. However, it can also produce inaccurate or incorrect results since it does not consider the context or part of speech of the word.

Lemmatization, on the other hand, is a more sophisticated technique that involves identifying the base or dictionary form of a word, also known as the lemma. Unlike stemming, lemmatization can consider the part of speech of the word, which can make it more accurate and reliable. With lemmatization, the lemma of the word “lazily” would be “lazy”. Lemmatization can be slower and more complex than stemming but provides a higher level of normalization.

# Initialize the stemmer and lemmatizer.
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()


def wordnet_pos(nltk_pos):
    """
    Function to map POS tags to wordnet tags for lemmatizer.
    """
    if nltk_pos.startswith("V"):
        return wordnet.VERB
    elif nltk_pos.startswith("J"):
        return wordnet.ADJ
    elif nltk_pos.startswith("R"):
        return wordnet.ADV
    return wordnet.NOUN


# Perform stemming and lemmatization seperately on the tokens.
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token, wordnet_pos(tag))
                     for token, tag in nltk.pos_tag(tokens)]

# Print the results.
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)
Stemmed text: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.', 'the', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Lemmatized text: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Stopword Removal#

Stopword removal is a common technique used in NLP to preprocess and clean text data by removing words that are considered to be of little or no value in terms of conveying meaning or information. These words are called “stopwords” and they include common words such as “the”, “a”, “an”, “and”, “or”, “but”, and so on.

The purpose of stopword removal in NLP is to improve the accuracy and efficiency of text analysis and processing by reducing the noise and complexity of the data. Stopwords are often used to form grammatical structures in a sentence, but they do not carry much meaning or relevance to the main topic or theme of the text. So by removing these words, we can reduce the dimensionality of the text data, improve the performance of machine learning models, and speed up the processing of text data. NLTK has a predefined list of stopwords for English.

# English stopwords in NLTK.
stopwords_list = stopwords.words('english')
print(stopwords_list)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Assignment for Task 3.1: Tokenization and Lemmatization#

The first step is to tokenize and lemmatize the sentences. Your task (which is your assignment) is to write functions to do the following:

  • Since we want to use our text to make a model later on, we need to preprocess it. Add a tokens column to the df_train dataframe with the text tokenized, then lemmatize those tokens. You must use the POS tags when lemmatizing.

    • Hint: Use the pandas.Series.apply function with the imported nltk.tokenize.word_tokenize function. Recall that you can use the pd.Series.apply? syntax in a code cell for more information.

    • Hint: use the nltk.stem.WordNetLemmatizer.lemmatize function to lemmatize a token. Use the wordnet_pos function to obtain the POS tag for the lemmatizer.

  • Tokenizing and lemmatizing the entire dataset can take a while too. Use tqdm and the pandas.Series.progress_apply to show progress bars for the operations.

def tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "text" column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added "tokens" column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dataframe that looks like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

# This part of code will take some time to run.
answer_df_with_tokens = answer_tokenize_and_lemmatize(df_train_reformat)
answer_df_with_tokens.groupby("class").head(n=5)
class_idx class text tokens
0 3 Business Wall St. Bears Claw Back Into the Black (Reute... [Wall, St., Bears, Claw, Back, Into, the, Blac...
1 3 Business Carlyle Looks Toward Commercial Aerospace (Reu... [Carlyle, Looks, Toward, Commercial, Aerospace...
2 3 Business Oil and Economy Cloud Stocks' Outlook (Reuters... [Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3 3 Business Iraq Halts Oil Exports from Main Southern Pipe... [Iraq, Halts, Oil, Exports, from, Main, Southe...
4 3 Business Oil prices soar to all-time record, posing new... [Oil, price, soar, to, all-time, record, ,, po...
78 4 Sci/Tech 'Madden,' 'ESPN' Football Score in Different W... ['Madden, ,, ', 'ESPN, ', Football, Score, in,...
79 4 Sci/Tech Group to Propose New High-Speed Wireless Forma... [Group, to, Propose, New, High-Speed, Wireless...
80 4 Sci/Tech AOL to Sell Cheap PCs to Minorities and Senior... [AOL, to, Sell, Cheap, PCs, to, Minorities, an...
81 4 Sci/Tech Companies Approve New High-Capacity Disc Forma... [Companies, Approve, New, High-Capacity, Disc,...
82 4 Sci/Tech Missing June Deals Slow to Return for Software... [Missing, June, Deals, Slow, to, Return, for, ...
448 2 Sports Phelps, Thorpe Advance in 200 Freestyle (AP) A... [Phelps, ,, Thorpe, Advance, in, 200, Freestyl...
449 2 Sports Reds Knock Padres Out of Wild-Card Lead (AP) A... [Reds, Knock, Padres, Out, of, Wild-Card, Lead...
450 2 Sports Dreaming done, NBA stars awaken to harsh Olymp... [Dreaming, do, ,, NBA, star, awaken, to, harsh...
451 2 Sports Indians Beat Twins 7-1, Nearing AL Lead (AP) A... [Indians, Beat, Twins, 7-1, ,, Nearing, AL, Le...
452 2 Sports Galaxy, Crew Play to 0-0 Tie (AP) AP - Kevin H... [Galaxy, ,, Crew, Play, to, 0-0, Tie, (, AP, )...
492 1 World Venezuelans Vote Early in Referendum on Chavez... [Venezuelans, Vote, Early, in, Referendum, on,...
493 1 World S.Koreans Clash with Police on Iraq Troop Disp... [S.Koreans, Clash, with, Police, on, Iraq, Tro...
494 1 World Palestinians in Israeli Jails Start Hunger Str... [Palestinians, in, Israeli, Jails, Start, Hung...
495 1 World Seven Georgian soldiers wounded as South Osset... [Seven, Georgian, soldier, wound, a, South, Os...
496 1 World Rwandan Troops Arrive in Darfur (AP) AP - Doze... [Rwandan, Troops, Arrive, in, Darfur, (, AP, )...

The code below tests if the output of your function matches the expected output.

df_with_tokens = tokenize_and_lemmatize(df_train_reformat)
check_answer_df(df_with_tokens, answer_df_with_tokens)
Test case 1 failed.
Your output is:
None
Expected output is:
class_idx class text tokens
0 3 Business Wall St. Bears Claw Back Into the Black (Reute... [Wall, St., Bears, Claw, Back, Into, the, Blac...
1 3 Business Carlyle Looks Toward Commercial Aerospace (Reu... [Carlyle, Looks, Toward, Commercial, Aerospace...
2 3 Business Oil and Economy Cloud Stocks' Outlook (Reuters... [Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3 3 Business Iraq Halts Oil Exports from Main Southern Pipe... [Iraq, Halts, Oil, Exports, from, Main, Southe...
4 3 Business Oil prices soar to all-time record, posing new... [Oil, price, soar, to, all-time, record, ,, po...
... ... ... ... ...
4865 2 Sports IOA sets up committee to probe dope scandal At... [IOA, set, up, committee, to, probe, dope, sca...
4866 2 Sports FACTBOX-Jonathan Woodgate factbox MADRID, Aug ... [FACTBOX-Jonathan, Woodgate, factbox, MADRID, ...
4867 2 Sports British canoe pair lose out ATHENS (Reuters) -... [British, canoe, pair, lose, out, ATHENS, (, R...
4895 2 Sports U.S. Softball Team Posts Shutout No. 7 (AP) AP... [U.S., Softball, Team, Posts, Shutout, No, ., ...
4896 2 Sports Paul Hamm's example PAUL HAMM'S fall and rise ... [Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...

4000 rows × 4 columns

Assignment for Task 3.2: Word Counts#

To see what the most used words per class are, create a new, seperate dataframe with token counts.

  • Hint: use the pandas.Series.apply and str.isalpha() functions to filter out non-alphabetical tokens.

  • Hint: use the pandas.DataFrame.explode to create one row per class and token.

  • Hint: use pandas.DataFrame.groupby with .size() afterwards or pandas.DataFrame.pivot_table with size as the aggfunc to obtain the occurences per class.

  • Hint: use the pandas.Series.reset_index function to obtain a dataframe with [class, tokens, count] as the columns.

  • Hint: use the pandas.DataFrame.sort_values function for sorting a dataframe.

def get_word_counts(df, token_col="tokens"):
    """
    Generate dataframes with the word counts for each class in the data.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "class" and "tokens" columns.
    token_col: str
        Name of the column that stores the tokens.

    Returns
    -------
    pandas.DataFrame:
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dictionary of dataframes (one per class) that look like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

answer_word_counts = answer_get_word_counts(answer_df_with_tokens, token_col="tokens")
answer_word_counts.groupby("class").head(n=5)
class tokens count
3540 Business the 1460
0 Business a 1217
3586 Business to 923
1743 Business in 776
2423 Business on 687
8929 Sci/Tech the 1660
3960 Sci/Tech a 1121
8997 Sci/Tech to 1074
7319 Sci/Tech of 939
4422 Sci/Tech be 744
13816 Sports the 2280
9565 Sports a 978
11631 Sports in 815
13888 Sports to 793
12471 Sports of 714
18749 World the 1653
14384 World a 1241
18808 World to 1172
16474 World in 1164
17302 World of 908

The code below tests if the output of your function matches the expected output.

word_counts = get_word_counts(df_with_tokens, token_col="tokens")
check_answer_df(answer_word_counts, word_counts)
Test case 1 failed.
Your output is:
class tokens count
3540 Business the 1460
0 Business a 1217
3586 Business to 923
1743 Business in 776
2423 Business on 687
... ... ... ...
19259 World zalmay 1
19261 World zeitoun 1
19262 World zesn 1
19263 World zim 1
19265 World zimbabwean 1

19268 rows × 3 columns

Expected output is:
None

In the following function, we use the wordcloud package to visualize the word counts that you just computed.

def visualize_word_counts(df, class_col="class", token_col="tokens", count_col="count"):
    """
    Displays word clouds given a word counts dataframe.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe with three columns:
        - The "class" column (each document class)
        - The "tokens" column (showing tokens in each document class)
        - The "count" column (showing counts for each token)
    class_col : str
        Name of the class column (if different from "class").
    token_col : str
        Name of the token column (if different from "tokens").
    count_col : str
        Name of the count column (if different from "count").
    """
    # Groupby the class column and loop through all of them
    for name, df_group in df.groupby(class_col):
        # Compute a dictionary with word frequencies
        frequencies = dict(zip(df_group[token_col], df_group[count_col]))
        # Generate word cloud from frequencies
        wordcloud = WordCloud(background_color="white", width=1000, height=500, random_state=42).generate_from_frequencies(frequencies)
        # Display image
        plt.axis("off")
        plt.title("Class: " + name)
        plt.imshow(wordcloud)
        plt.show()
visualize_word_counts(answer_word_counts)
../_images/tutorial-text-data_57_0.png ../_images/tutorial-text-data_57_1.png ../_images/tutorial-text-data_57_2.png ../_images/tutorial-text-data_57_3.png

Assignment for Task 3.3: Stop Words Removal#

The stop words make it difficult for us to identify representative words for each class. Let’s display the word counts using the data without stop words. But we need to remove the stop words first. Your task (which is your assignment) is to write functions to do the following:

  • Remove the stopwords from the tokens column in the dataframe.

    • Hint: use the pandas.DataFrame.isin function.

    • Hint: use the stopwords_list variable to help you check if a token is a stop word.

def remove_stopwords(df):
    """
    Remove stopwords from the tokens.

    Parameters
    ----------
    df : pandas.DataFrame
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the stopwords rows removed.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dictionary of dataframes (one per class) that look like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

answer_word_counts_no_stopword = answer_remove_stopwords(answer_word_counts)
answer_word_counts_no_stopword.groupby("class").head(n=5)
class tokens count
2947 Business reuters 483
2351 Business new 297
2697 Business price 281
3040 Business say 281
2418 Business oil 278
4208 Sci/Tech ap 236
7228 Sci/Tech new 224
8237 Sci/Tech say 186
4951 Sci/Tech company 137
7031 Sci/Tech microsoft 121
9817 Sports athens 354
9747 Sports ap 315
12497 Sports olympic 244
14281 Sports win 223
11335 Sports gold 208
18159 World say 305
18012 World reuters 276
14598 World ap 261
17194 World najaf 154
14474 World afp 149

The code below tests if the output of your function matches the expected output.

word_counts_no_stopword = remove_stopwords(word_counts)
check_answer_df(word_counts_no_stopword, answer_word_counts_no_stopword)
Test case 1 failed.
Your output is:
None
Expected output is:
class tokens count
2947 Business reuters 483
2351 Business new 297
2697 Business price 281
3040 Business say 281
2418 Business oil 278
... ... ... ...
19259 World zalmay 1
19261 World zeitoun 1
19262 World zesn 1
19263 World zim 1
19265 World zimbabwean 1

18787 rows × 3 columns

visualize_word_counts(answer_word_counts_no_stopword)
../_images/tutorial-text-data_66_0.png ../_images/tutorial-text-data_66_1.png ../_images/tutorial-text-data_66_2.png ../_images/tutorial-text-data_66_3.png

Another Option: spaCy#

spaCy is another library used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and much more. It provides pre-trained models for different languages and domains, which can be used as-is but also can be fine-tuned on a specific task or domain.

In an object-oriented way, spaCy can be thought of as a collection of classes and objects that work together to perform NLP tasks. Some of the important functions and classes in spaCy include:

  • nlp: The core function that provides the main functionality of spaCy. It is used to process text and create a Doc object.

  • Doc: A container for accessing linguistic annotations like tokens, part-of-speech tags, named entities, and dependency parse information. It is created by the nlp function and represents a processed document.

  • Token: An object representing a single token in a Doc object. It contains information like the token text, part-of-speech tag, lemma, embedding, and much more.

When a text is processed by spaCy, it is first passed to the nlp function, which uses the loaded model to tokenize the text and applies various linguistic annotations like part-of-speech tagging, named entity recognition, and dependency parsing in the background. The resulting annotations are stored in a Doc object, which can be accessed and manipulated using various methods and attributes.

# Load the small English model in spaCy.
# Disable Named Entity Recognition and the parser in the model pipeline since we're not using it.
# Check the following website for the spaCy NLP pipeline:
# - https://spacy.io/usage/processing-pipelines
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# Process the text using spaCy.
doc = nlp(text)

# This becomes a spaCy Doc object, which prints nicely as the original string.
print(doc)
The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.

The Doc object can be iterated over to access each Token object in the document. We can also directly access multiple attributes of the Token objects. For example, we can directly access the lemma of the token with Token.lemma_ and check if a token is a stop word with Token.is_stop. To make it easy to see them, we put them in a data frame.

spacy_doc_attributes = [(token, token.lemma_, token.is_stop) for token in doc]
pd.DataFrame(data=spacy_doc_attributes, columns=["token", "lemma", "is_stopword"])
token lemma is_stopword
0 The the True
1 quick quick False
2 brown brown False
3 fox fox False
4 jumped jump False
5 over over True
6 the the True
7 lazy lazy False
8 dog dog False
9 . . False
10 The the True
11 cats cat False
12 could could True
13 n't not True
14 wait wait False
15 to to True
16 sleep sleep False
17 all all True
18 day day False
19 . . False

The above example only deals with one sentence. Now we need to deal with all the sentences in all the classes. Below is a function to add a column with a Doc representation of the text column to the dataframe.

def add_spacy_doc(df):
    """
    Add a column with the spaCy Doc objects.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "text" column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added "doc" column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Get the number of CPUs in the machine.
    n_process = max(1, os.cpu_count()-2)

    # Use multiple CPUs to speed up computing.
    df["doc"] = [doc for doc in tqdm(nlp.pipe(df["text"], n_process=n_process), total=df.shape[0])]

    return df

Now we can add the spaCy tokens using the above function. This step will take some time since it needs to process all the sentences. So we added a progress bar.

df_with_nltk_tokens_and_spacy_doc = add_spacy_doc(answer_df_with_tokens)
display(df_with_nltk_tokens_and_spacy_doc)
class_idx class text tokens doc
0 3 Business Wall St. Bears Claw Back Into the Black (Reute... [Wall, St., Bears, Claw, Back, Into, the, Blac... (Wall, St., Bears, Claw, Back, Into, the, Blac...
1 3 Business Carlyle Looks Toward Commercial Aerospace (Reu... [Carlyle, Looks, Toward, Commercial, Aerospace... (Carlyle, Looks, Toward, Commercial, Aerospace...
2 3 Business Oil and Economy Cloud Stocks' Outlook (Reuters... [Oil, and, Economy, Cloud, Stocks, ', Outlook,... (Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3 3 Business Iraq Halts Oil Exports from Main Southern Pipe... [Iraq, Halts, Oil, Exports, from, Main, Southe... (Iraq, Halts, Oil, Exports, from, Main, Southe...
4 3 Business Oil prices soar to all-time record, posing new... [Oil, price, soar, to, all-time, record, ,, po... (Oil, prices, soar, to, all, -, time, record, ...
... ... ... ... ... ...
4865 2 Sports IOA sets up committee to probe dope scandal At... [IOA, set, up, committee, to, probe, dope, sca... (IOA, sets, up, committee, to, probe, dope, sc...
4866 2 Sports FACTBOX-Jonathan Woodgate factbox MADRID, Aug ... [FACTBOX-Jonathan, Woodgate, factbox, MADRID, ... (FACTBOX, -, Jonathan, Woodgate, factbox, MADR...
4867 2 Sports British canoe pair lose out ATHENS (Reuters) -... [British, canoe, pair, lose, out, ATHENS, (, R... (British, canoe, pair, lose, out, ATHENS, (, R...
4895 2 Sports U.S. Softball Team Posts Shutout No. 7 (AP) AP... [U.S., Softball, Team, Posts, Shutout, No, ., ... (U.S., Softball, Team, Posts, Shutout, No, ., ...
4896 2 Sports Paul Hamm's example PAUL HAMM'S fall and rise ... [Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall... (Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...

4000 rows × 5 columns

The following function will add the spacy tokens to our original dataframe.

def add_spacy_tokens(df):
    """
    Add a column with a list of lemmatized tokens, without stopwords and numbers.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "doc" column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the "spacy_tokens" column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    df["spacy_tokens"] = df["doc"].apply(
        lambda tokens: [token.lemma_ for token in tokens if token.is_alpha and not token.is_stop]
    )

    return df

We can run the code below to add the spacy tokens.

df_with_nltk_tokens_and_spacy_tokens = add_spacy_tokens(df_with_nltk_tokens_and_spacy_doc)
display(df_with_nltk_tokens_and_spacy_tokens)
class_idx class text tokens doc spacy_tokens
0 3 Business Wall St. Bears Claw Back Into the Black (Reute... [Wall, St., Bears, Claw, Back, Into, the, Blac... (Wall, St., Bears, Claw, Back, Into, the, Blac... [Wall, Bears, Claw, Black, Reuters, Reuters, S...
1 3 Business Carlyle Looks Toward Commercial Aerospace (Reu... [Carlyle, Looks, Toward, Commercial, Aerospace... (Carlyle, Looks, Toward, Commercial, Aerospace... [Carlyle, look, Commercial, Aerospace, Reuters...
2 3 Business Oil and Economy Cloud Stocks' Outlook (Reuters... [Oil, and, Economy, Cloud, Stocks, ', Outlook,... (Oil, and, Economy, Cloud, Stocks, ', Outlook,... [oil, Economy, Cloud, Stocks, Outlook, Reuters...
3 3 Business Iraq Halts Oil Exports from Main Southern Pipe... [Iraq, Halts, Oil, Exports, from, Main, Southe... (Iraq, Halts, Oil, Exports, from, Main, Southe... [Iraq, Halts, Oil, Exports, Main, Southern, Pi...
4 3 Business Oil prices soar to all-time record, posing new... [Oil, price, soar, to, all-time, record, ,, po... (Oil, prices, soar, to, all, -, time, record, ... [oil, price, soar, time, record, pose, new, me...
... ... ... ... ... ... ...
4865 2 Sports IOA sets up committee to probe dope scandal At... [IOA, set, up, committee, to, probe, dope, sca... (IOA, sets, up, committee, to, probe, dope, sc... [IOA, set, committee, probe, dope, scandal, At...
4866 2 Sports FACTBOX-Jonathan Woodgate factbox MADRID, Aug ... [FACTBOX-Jonathan, Woodgate, factbox, MADRID, ... (FACTBOX, -, Jonathan, Woodgate, factbox, MADR... [FACTBOX, Jonathan, Woodgate, factbox, MADRID,...
4867 2 Sports British canoe pair lose out ATHENS (Reuters) -... [British, canoe, pair, lose, out, ATHENS, (, R... (British, canoe, pair, lose, out, ATHENS, (, R... [british, canoe, pair, lose, ATHENS, Reuters, ...
4895 2 Sports U.S. Softball Team Posts Shutout No. 7 (AP) AP... [U.S., Softball, Team, Posts, Shutout, No, ., ... (U.S., Softball, Team, Posts, Shutout, No, ., ... [Softball, Team, Posts, Shutout, AP, AP, Cat, ...
4896 2 Sports Paul Hamm's example PAUL HAMM'S fall and rise ... [Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall... (Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall... [Paul, Hamm, example, PAUL, HAMM, fall, rise, ...

4000 rows × 6 columns

Now we can use the function that we wrote before to get the word count from the spacy tokens.

spacy_word_counts = answer_get_word_counts(df_with_nltk_tokens_and_spacy_tokens, token_col="spacy_tokens")
spacy_word_counts.groupby("class").head(n=5)
class spacy_tokens count
2792 Business reuters 483
2559 Business price 308
2231 Business new 298
2881 Business say 287
2290 Business oil 281
3933 Sci/Tech ap 236
6784 Sci/Tech new 225
7740 Sci/Tech say 161
4630 Sci/Tech company 140
6593 Sci/Tech microsoft 124
9177 Sports athens 355
9109 Sports ap 315
11771 Sports olympic 261
13437 Sports win 252
10644 Sports gold 212
17099 World say 294
16956 World reuters 276
13717 World ap 261
16176 World najaf 157
13610 World afp 150

Task 4: Unsupervised Learning - Topic Modeling#

Topic modelling is a technique used in NLP that aims to identify the underlying topics or themes in a collection of texts. One way to perform topic modelling is using the probabilistic model Latent Dirichlet Allocation (LDA).

LDA assumes that each document in a collection is a mixture of different topics, and each topic is a probability distribution over a set of words. The model then infers the underlying topic distribution for each document in the collection and the word distribution for each topic. LDA is trained using an iterative algorithm that maximizes the likelihood of observing the given documents.

To use LDA, we need to represent the documents as a bag of words, where the order of the words is ignored and only the frequency of each word in the document is considered. This bag-of-words representation allows us to represent each document as a vector of word frequencies, which can be used as input to the LDA algorithm. Computing LDA might take a moment on our dataset size.

# Convert preprocessed text to bag-of-words representation using CountVectorizer.
vectorizer = CountVectorizer(max_features=1000)

We will use the fit_transform function in the vectorizer. But in this case, we need a string that represents a sentence as the input. So, we can just join all the tokens together into one string. We also reset the index for consistency.

df_strings = df_with_nltk_tokens_and_spacy_tokens["spacy_tokens"].apply(lambda x: " ".join(x))
df_strings = df_strings.reset_index(drop=True)
df_strings
0       Wall Bears Claw Black Reuters Reuters Short se...
1       Carlyle look Commercial Aerospace Reuters Reut...
2       oil Economy Cloud Stocks Outlook Reuters Reute...
3       Iraq Halts Oil Exports Main Southern Pipeline ...
4       oil price soar time record pose new menace eco...
                              ...                        
3995    IOA set committee probe dope scandal Athens Au...
3996    FACTBOX Jonathan Woodgate factbox MADRID Aug R...
3997    british canoe pair lose ATHENS Reuters Slovaki...
3998    Softball Team Posts Shutout AP AP Cat Osterman...
3999    Paul Hamm example PAUL HAMM fall rise Olympics...
Name: spacy_tokens, Length: 4000, dtype: object

Then, we can use the fit_transform function to get the bag of words vector.

X = vectorizer.fit_transform(df_strings.values)

We convert the original matrix to a data frame to make it easier to see the bag of words. The columns indicate tokens, and the values for each cell indicate the word counts. The number of columns in the data frame matches the max_features parameter in the CountVectorizer. The number of rows matches the size of the training data.

pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
accept accord accounting accuse action activity add advance afghanistan afp ... wrap xinhuanet xp yahoo yankees year yesterday york young yukos
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 2 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3995 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3996 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3997 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3998 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3999 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

4000 rows × 1000 columns

Now we have the bag of words vector. We can use the vector for LDA topic modeling.

# Define the number of topics to model with LDA.
num_topics = 4

# Fit LDA to the feature matrix. Verbose so we know what iteration we are on.
# The random state is just for producing consistent results.
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42, verbose=True)
f = lda.fit(X)
iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10

Now we can check the topic vectors in the LDA model. Each vector represents the topic in a high dimensional space. The high dimensional space is formed by using the word tokens. So, the vectors can also be viewed as weights that represents the number of importance that a word token was assigned to the topic. In the following code block, we print the shape of the vectors. The row size should match the number of topics that we set before. The column size should match the max_features parameter, which means the number of words.

lda.components_.shape
(4, 1000)

Assignment for Task 4#

We want to get the weights for each word in each topic and visualize them using word clouds. In the above case, the shape should be (4, 1000), which means we have 4 topics, and each topic is represented by a distribution (i.e., weights) of 1000 words. To make the world cloud visualization simple, we only wants to use the top n number of words with the highest weights.

Your task (which is your assignment) is to write functions to do the following:

  • Given a 1D NumPy array, return the indexes of the top n number of items according to their values. In other words, we want the indexes that can help us select the highest n values. For example, for n=3 in array [3,1,2,4,0], the function should return [3,0,2], because the highest value is 4 with index 3 in the original array, and so on.

    • Hint: use the numpy.argsort function.

  • Notice that the numpy.argsort function gives you the indexes from the array items with the lowest value, which is not what we want. You need to figure out a way to reverse a numpy array and select the top n items.

def get_index_of_top_n_items(array, n):
    """
    Given an NumPy array, return the indexes of the top "n" number of items according to their values.

    Parameters
    ----------
    array : numpy.ndarray
        A 1D NumPy array.
    n : int
        The top "n" number of items that we want.

    Returns
    -------
    numpy.ndarray
        The indexes of the top "n" items.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The following code shows the example that we mentioned above.

A = np.array([3,1,2,4,0])
answer_top_n_for_A = answer_get_index_of_top_n_items(A, n=3)
answer_top_n_for_A
array([3, 0, 2])

The code below tests if the output of your function matches the expected output.

B = lda.components_[0]
top_n_for_topic_0 = get_index_of_top_n_items(B, n=10)
answer_top_n_for_topic_0 = answer_get_index_of_top_n_items(B, n=10)
check_answer_np(top_n_for_topic_0, answer_top_n_for_topic_0)
Test case 1 failed.
Your output is:
None
Expected output is:
array([ 40, 771, 587, 995,   9, 668, 746, 482, 600, 559])

We can now use the function that we just implemented in the following function to help us get the weights for the top n words for each topic.

def get_word_weights_for_topics(lda_model, vectorizer, n=100):
    """
    Get weights for words for each topic.

    Parameters
    ----------
    lda_model : sklearn.decomposition.LatentDirichletAllocation
        The LDA model.
    vectorizer : sklearn.feature_extraction.text.CountVectorizer
        The count vectorizer.
    n : int
        Number of important words that we want to get.

    Returns
    -------
    dict of pandas.DataFrame
        A dictionary with data frames.
    """
    words = vectorizer.get_feature_names_out()
    n = len(words) if n is None else n
    topic_word_weights = {}

    for idx, topic_vector in enumerate(lda_model.components_):
        top_features_ind = answer_get_index_of_top_n_items(topic_vector, n=n)
        top_features = [words[i] for i in top_features_ind]
        weights = topic_vector[top_features_ind]
        df = pd.DataFrame(weights, index=top_features, columns=["weight"])
        df = df.sort_values(by="weight", ascending=False)
        topic_word_weights[idx] = df

    return topic_word_weights

Now we can take a look at the data first. For simplicity, we only print the first 10 important words for each topic.

topic_word_weights = get_word_weights_for_topics(lda, vectorizer, n=100)
for k, v in topic_word_weights.items():
    print(f"\nTopic #{k}:")
    print(" ".join(v.index[0:10]))
    display(v.iloc[0:10])
Topic #0:
ap say new year afp president reuters kill official monday
weight
ap 777.900339
say 275.078724
new 258.558777
year 182.610725
afp 154.539267
president 152.645783
reuters 129.226361
kill 128.812359
official 126.576249
monday 124.602403
Topic #1:
athens olympic win gold reuters olympics medal team man greece
weight
athens 402.248217
olympic 308.247659
win 264.590555
gold 243.248904
reuters 206.480045
olympics 176.248240
medal 173.249521
team 165.746113
man 156.072367
greece 141.247460
Topic #2:
reuters oil price new high say rise profit stock tuesday
weight
reuters 462.338250
oil 335.243662
price 306.689610
new 281.894969
high 272.691721
say 203.454902
rise 194.643738
profit 185.251643
stock 174.271560
tuesday 155.282308
Topic #3:
google company say reuters new public najaf share ipo security
weight
google 348.246768
company 294.646492
say 265.408176
reuters 249.955344
new 181.034259
public 165.080293
najaf 157.247419
share 145.815936
ipo 141.248697
security 139.963240

Then, we can use the word weights to create word clouds.

# Generate a word cloud for each topic.
for topic_idx, words in topic_word_weights.items():
    frequencies = dict(zip(words.index, words["weight"]))
    wordcloud = WordCloud(background_color="white", width=1000, height=500).generate_from_frequencies(frequencies)
    # Display image
    plt.axis("off")
    plt.title(f"Topic {topic_idx}")
    plt.imshow(wordcloud)
    plt.show()
../_images/tutorial-text-data_110_0.png ../_images/tutorial-text-data_110_1.png ../_images/tutorial-text-data_110_2.png ../_images/tutorial-text-data_110_3.png

Compare this with the word cloud visualizations in the pre-processing step previously. Does the LDA topic modeling represent the actural four document classes in the training data? What do you think?

For this task, we mainly use a qualitative way to evaluate topic modeling by visually inspecting the word clouds. There are also quantiative ways to evaluate the models, but they are not covered in this course. If you are interested in this, check the following resources:

Task 5: Supervised Learning - Topic Classification#

Topic classification is a task in NLP that involves automatically assigning a given text document to one or more predefined categories or topics. This task is essential for various applications, such as document organization, search engines, sentiment analysis, and more.

In recent years, deep learning models have shown remarkable performance in various NLP tasks, including topic classification. We will explore a neural network-based approach for topic classification using the PyTorch framework. PyTorch provides an efficient way to build and train neural networks with a high degree of flexibility and ease of use.

Compute Word Embeddings#

We will first look at word embeddings, which represent words as vectors in a high-dimensional space. The key idea behind word embeddings is that words with similar meanings tend to appear in similar contexts, and therefore their vector representations should be close together in this high-dimensional space. Word embeddings have been widely used in various NLP tasks such as sentiment analysis, machine translation, and information retrieval.

There are several techniques to generate word embeddings, but one of the most popular methods is the Word2Vec algorithm, which is based on a neural network architecture. Word2Vec learns embeddings by predicting the probability of a word given its context (continuous bag of words or skip-gram model). The output of the network is a set of word vectors that can be used as embeddings.

We can train a Word2Vec model ourselves, but keep in mind that later on, it’s not nice if we don’t have embeddings for certain words in the test set. So let’s first apply the familiar preprocessing steps to the test set:

# Reformat the test set.
df_test_reformat = reformat_data(df_test)

# NLTK preprocessing.
df_test_with_tokens = answer_tokenize_and_lemmatize(df_test_reformat)

# spaCy preprocessing.
df_test_with_nltk_tokens_and_spacy_tokens = add_spacy_tokens(add_spacy_doc(df_test_with_tokens))

display(df_test_with_nltk_tokens_and_spacy_tokens)
class_idx class text tokens doc spacy_tokens
0 3 Business Fears for T N pension after talks Unions repre... [Fears, for, T, N, pension, after, talk, Union... (Fears, for, T, N, pension, after, talks, Unio... [fear, T, N, pension, talk, Unions, represent,...
1 4 Sci/Tech The Race is On: Second Private Team Sets Launc... [The, Race, be, On, :, Second, Private, Team, ... (The, Race, is, On, :, Second, Private, Team, ... [Race, second, private, team, set, Launch, Dat...
2 4 Sci/Tech Ky. Company Wins Grant to Study Peptides (AP) ... [Ky., Company, Wins, Grant, to, Study, Peptide... (Ky., Company, Wins, Grant, to, Study, Peptide... [Company, win, Grant, study, Peptides, AP, AP,...
3 4 Sci/Tech Prediction Unit Helps Forecast Wildfires (AP) ... [Prediction, Unit, Helps, Forecast, Wildfires,... (Prediction, Unit, Helps, Forecast, Wildfires,... [prediction, Unit, help, Forecast, Wildfires, ...
4 4 Sci/Tech Calif. Aims to Limit Farm-Related Smog (AP) AP... [Calif, ., Aims, to, Limit, Farm-Related, Smog... (Calif., Aims, to, Limit, Farm, -, Related, Sm... [aim, limit, Farm, Related, Smog, AP, AP, Sout...
... ... ... ... ... ... ...
473 3 Business New Overtime Rules Take Effect New Bush admini... [New, Overtime, Rules, Take, Effect, New, Bush... (New, Overtime, Rules, Take, Effect, New, Bush... [New, Overtime, Rules, Effect, New, Bush, admi...
479 3 Business Dollar Holds Gains, Fed Comments Help TOKYO (... [Dollar, Holds, Gains, ,, Fed, Comments, Help,... (Dollar, Holds, Gains, ,, Fed, Comments, Help,... [Dollar, hold, Gains, Fed, Comments, help, TOK...
481 3 Business Dark arts of spin evident in phoney war for Ab... [Dark, art, of, spin, evident, in, phoney, war... (Dark, arts, of, spin, evident, in, phoney, wa... [dark, art, spin, evident, phoney, war, Abbey,...
482 3 Business Controversial US Overtime Rules Take Effect Ne... [Controversial, US, Overtime, Rules, Take, Eff... (Controversial, US, Overtime, Rules, Take, Eff... [controversial, Overtime, Rules, Effect, new, ...
484 3 Business SAS Braathens to cut Gatwick, Geneva flights b... [SAS, Braathens, to, cut, Gatwick, ,, Geneva, ... (SAS, Braathens, to, cut, Gatwick, ,, Geneva, ... [SAS, Braathens, cut, Gatwick, Geneva, flight,...

400 rows × 6 columns

To obtain the complete model, we combine the tokens column into one series and call the Word2Vec function.

# Rename the very long variables
df_train_preprocessd = df_with_nltk_tokens_and_spacy_tokens
df_test_preprocessd = df_test_with_nltk_tokens_and_spacy_tokens

# Get all tokens into one series.
tokens_both = pd.concat([df_train_preprocessd["tokens"], df_test_preprocessd["tokens"]])

# Train a Word2Vec model on the NLTK tokens.
w2v_model = Word2Vec(tokens_both.values, vector_size=40, min_count=1)

To obtain the embeddings, we can use the Word2Vec.wv[word] syntax. To get multiple vectors nicely next to each other in a 2D matrix, we can call numpy.vstack.

print(np.vstack([w2v_model.wv[word] for word in ["rain", "cat", "dog"]]))
[[-2.76309401e-01 -7.75136873e-02  3.03611517e-01  2.10117131e-01
   7.57407770e-02 -6.52443692e-02 -8.24972093e-02  7.95511529e-02
  -2.73732506e-02  3.29312652e-01 -1.14486247e-01 -4.21194583e-01
  -6.78604618e-02  3.98109434e-03  3.59004796e-01  1.16996072e-01
   3.15445438e-02  2.08379328e-01  5.15416116e-02 -2.08713517e-01
  -3.04057449e-01 -6.45264983e-02  3.26634556e-01 -9.80910361e-02
  -3.32489789e-01 -3.82966809e-02 -9.86100435e-02  5.41328311e-01
  -3.95968288e-01 -2.45931149e-01  3.30803782e-01  1.48946956e-01
   2.03940153e-01 -9.69057679e-02  1.58692021e-02  1.50189862e-01
   7.09178805e-01 -2.64038652e-01 -4.34074432e-01 -2.68073887e-01]
 [ 4.56083333e-03  1.31083094e-02  7.98655115e-03 -2.63949041e-04
  -4.30374034e-02  1.06827244e-02 -1.94894290e-03  1.58272684e-02
   2.59405803e-02  2.87713557e-02  1.35446666e-03 -4.18917127e-02
  -3.02243214e-02 -8.29823501e-03  3.19179110e-02  5.47345169e-03
   3.09204757e-02 -1.22328512e-02 -3.03305164e-02 -2.89320890e-02
  -4.30800114e-03 -2.20099892e-02  4.92604710e-02 -1.15064194e-03
  -3.02943047e-02  1.91085767e-02 -2.83540934e-02  6.58964068e-02
  -2.68006418e-02 -3.31034139e-03  3.38040404e-02  2.95044240e-02
   4.62594591e-02 -3.08212116e-02 -2.07123328e-02  1.67044569e-02
   1.26870573e-01  2.46338472e-02 -3.80816832e-02  1.53579945e-02]
 [-9.91130546e-02  8.52283277e-03  8.49402323e-02  6.22117147e-02
   4.71997540e-03  3.20366817e-03 -3.65188904e-02  4.02424522e-02
   5.79491816e-03  1.54158622e-01 -1.50499800e-02 -1.59003824e-01
  -2.53102332e-02 -9.40567069e-03  1.14784084e-01  6.87591881e-02
   4.83738221e-02  2.07013451e-02 -3.70337218e-02 -8.76008272e-02
  -1.10275254e-01 -8.97379685e-03  1.38567820e-01 -4.64034118e-02
  -9.26503465e-02 -1.24102989e-02 -7.68525153e-02  1.83954731e-01
  -1.21105552e-01 -6.01319149e-02  1.25546858e-01  5.74087873e-02
   6.94805682e-02 -1.52010396e-02 -4.88784648e-02  4.25413921e-02
   3.23110342e-01 -3.86326574e-02 -1.77580908e-01 -7.31595308e-02]]

The spaCy model we used has a Tok2Vec algorithm in its pipeline, so we can directly access the 2D matrix of all word vectors on a document with the Doc.tensor attribute. Keep in mind this still contains the embeddings of the stopwords.

print(doc.tensor)
[[ 0.8194096  -0.32325384  0.629434   ... -0.4773795  -0.75188184
   0.0357812 ]
 [ 0.81603235 -1.2405076   0.9558864  ...  0.52738035 -1.0743449
  -0.30024663]
 [ 1.1372288  -1.0574455  -0.20238371 ...  0.38068694 -0.03450352
   0.540362  ]
 ...
 [ 1.2355547  -0.6400209  -0.59921527 ... -0.12730186 -0.3426052
  -1.1101209 ]
 [ 0.19090153 -0.6523549  -0.4373727  ...  0.8468437   0.49040866
  -0.14062503]
 [-0.5942377  -0.93374133  0.54625034 ... -0.05576921 -0.9447482
  -1.1440233 ]]

To prepare the word embeddings for classification, we will add a tensor column to both the dataframes for training and testing. Each cell in the tensor column should be a tensor array, representing the word embedding vector for the text in the corresponding row. The tensors need to have the same size for both the training and test sets, so we also need to pad the tensors with smaller sizes by adding zeros at the end.

def add_padded_tensors(df1, df2):
    """
    Add a tensor column to the dataframes, with every tensor having the same dimensions.

    Parameters
    ----------
    df1 : pandas.DataFrame
        The first dataframe containing at least the "tokens" or "doc" column.
    df2 : pandas.DataFrame
        The second dataframe containing at least the "tokens" or "doc" column.

    Returns
    -------
    tuple[pandas.DataFrame]
        The dataframes with the added tensor column.
    """
    # Copy the dataframes to avoid editing the originals.
    df1 = df1.copy(deep=True)
    df2 = df2.copy(deep=True)

    # Add tensors (option 1: using the Word2Vec model that we created).
    #for df in [df1, df2]:
    #    df["tensor"] = df["tokens"].apply(
    #        lambda tokens: np.vstack([w2v_model.wv[token] for token in tokens])
    #    )

    # Add tensors (option 2: using the spaCy tensors).
    for df in [df1, df2]:
        df["tensor"] = df["doc"].apply(lambda doc: doc.tensor)

    # Determine the largest amount of columns in both the training and test sets.
    largest = max(df1["tensor"].apply(lambda x: x.shape[0]).max(),
                  df2["tensor"].apply(lambda x: x.shape[0]).max())

    # Pad the tensors with zeros so that they have equal size.
    for df in [df1, df2]:
        df["tensor"] = df["tensor"].apply(
            lambda x: np.pad(x, ((0, largest - x.shape[0]), (0, 0)))
        )

    return df1, df2
df_train_with_tensor, df_test_with_tensor = add_padded_tensors(df_train_preprocessd, df_test_preprocessd)
display(df_test_with_tensor)
class_idx class text tokens doc spacy_tokens tensor
0 3 Business Fears for T N pension after talks Unions repre... [Fears, for, T, N, pension, after, talk, Union... (Fears, for, T, N, pension, after, talks, Unio... [fear, T, N, pension, talk, Unions, represent,... [[-1.8966789, 1.5618968, -0.4389819, 0.8901441...
1 4 Sci/Tech The Race is On: Second Private Team Sets Launc... [The, Race, be, On, :, Second, Private, Team, ... (The, Race, is, On, :, Second, Private, Team, ... [Race, second, private, team, set, Launch, Dat... [[1.0206602, -0.77784187, -0.07053307, 1.81432...
2 4 Sci/Tech Ky. Company Wins Grant to Study Peptides (AP) ... [Ky., Company, Wins, Grant, to, Study, Peptide... (Ky., Company, Wins, Grant, to, Study, Peptide... [Company, win, Grant, study, Peptides, AP, AP,... [[0.07833788, -0.781985, -0.4735602, 1.2093124...
3 4 Sci/Tech Prediction Unit Helps Forecast Wildfires (AP) ... [Prediction, Unit, Helps, Forecast, Wildfires,... (Prediction, Unit, Helps, Forecast, Wildfires,... [prediction, Unit, help, Forecast, Wildfires, ... [[-0.17650917, -0.8486336, 0.23646504, 0.21480...
4 4 Sci/Tech Calif. Aims to Limit Farm-Related Smog (AP) AP... [Calif, ., Aims, to, Limit, Farm-Related, Smog... (Calif., Aims, to, Limit, Farm, -, Related, Sm... [aim, limit, Farm, Related, Smog, AP, AP, Sout... [[0.22340375, -0.8782145, -0.6919652, -0.46230...
... ... ... ... ... ... ... ...
473 3 Business New Overtime Rules Take Effect New Bush admini... [New, Overtime, Rules, Take, Effect, New, Bush... (New, Overtime, Rules, Take, Effect, New, Bush... [New, Overtime, Rules, Effect, New, Bush, admi... [[0.020376865, -0.78690255, 0.89715064, 0.7673...
479 3 Business Dollar Holds Gains, Fed Comments Help TOKYO (... [Dollar, Holds, Gains, ,, Fed, Comments, Help,... (Dollar, Holds, Gains, ,, Fed, Comments, Help,... [Dollar, hold, Gains, Fed, Comments, help, TOK... [[-1.4306974, -1.2274694, 1.5035357, -0.057243...
481 3 Business Dark arts of spin evident in phoney war for Ab... [Dark, art, of, spin, evident, in, phoney, war... (Dark, arts, of, spin, evident, in, phoney, wa... [dark, art, spin, evident, phoney, war, Abbey,... [[-0.59349465, -1.1670003, -0.48161936, 0.4895...
482 3 Business Controversial US Overtime Rules Take Effect Ne... [Controversial, US, Overtime, Rules, Take, Eff... (Controversial, US, Overtime, Rules, Take, Eff... [controversial, Overtime, Rules, Effect, new, ... [[-0.32088518, -0.23858789, -0.5674784, 1.7897...
484 3 Business SAS Braathens to cut Gatwick, Geneva flights b... [SAS, Braathens, to, cut, Gatwick, ,, Geneva, ... (SAS, Braathens, to, cut, Gatwick, ,, Geneva, ... [SAS, Braathens, cut, Gatwick, Geneva, flight,... [[-1.0833734, -1.1490844, 0.7710656, 0.3060569...

400 rows × 7 columns

Build the Classifier#

Our neural network will take the embedding representation of the document as input and predict the corresponding topic using a softmax output layer. We will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.

The following code demonstrates how to implement a neural network for topic classification in PyTorch. First let’s do some more preparations for our inputs, turning them into PyTorch tensors.

# Transform spaCy tensors into PyTorch tensors.
input_train = torch.from_numpy(np.stack(df_train_with_tensor["tensor"]))
input_test = torch.from_numpy(np.stack(df_test_with_tensor["tensor"]))

# Get the labels, move to 0-indexed instead of 1-indexed.
train_labels = torch.from_numpy(df_train_with_tensor["class_idx"].values) - 1
test_labels = torch.from_numpy(df_test_with_tensor["class_idx"].values) - 1

# One-hot encode labels for training.
train_target = torch.zeros((len(train_labels), 4))
train_target = train_target.scatter_(1, train_labels.unsqueeze(1), 1).unsqueeze(1)

Then, it is time to define our network. The neural net consists of three fully connected layers (fc1, fc2, and fc3) with ReLU activation (relu) in between each layer. We flatten the input tensor using view before passing it through the fully connected layers. Finally, we apply the softmax activation function (softmax) to the output tensor to obtain the predicted probabilities for each class.

class TopicClassifier(nn.Module):
    def __init__(self, input_width, input_length, output_size):
        super(TopicClassifier, self).__init__()
        self.input_width = input_width
        self.input_length = input_length
        self.output_size = output_size

        self.fc1 = nn.Linear(input_width * input_length, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, output_size)

        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        # Flatten the input tensor.
        x = x.view(-1, self.input_width * self.input_length)

        # Pass through the fully connected layers with ReLU activation.
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)

        # Apply softmax activation to the output.
        x = self.softmax(x)
        return x

Now it’s time to train our network, this may take a while, but the current loss will be printed after every epoch. If you want to run the code faster, you can also put this notebook on Google Colab and use its provided GPU to speed up computing.

# Define parameters.
n_classes = len(train_labels.unique())
input_size = input_train.shape[1:]
num_epochs = 5
lr = 0.001

# Define model, loss function and optimizer.
model = TopicClassifier(*input_size, output_size=n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# Training loop.
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(zip(input_train, train_target)):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")
Epoch [1/5], Loss: 1.3814
Epoch [2/5], Loss: 1.2614
Epoch [3/5], Loss: 0.8655
Epoch [4/5], Loss: 0.7708
Epoch [5/5], Loss: 0.7529

Optional Assignment for Task 5#

The following code evaluates the model using a confusion matrix.

# Evaluate the neural net on the test set.
model.eval()

# Sample from the model.
with torch.no_grad():
    test_outputs = model(input_test)
    # Reuse our previous function to get the label with biggest probability.
    test_pred = np.argmax(test_outputs.detach(), axis=1)

# Set model back to training mode.
model.train()

# Compute the confusion matrix
cm = confusion_matrix(test_labels, test_pred)

# Plot the confusion matrix using seaborn
labels = ["World", "Sports", "Business", "Sci/Tech"]
h = sns.heatmap(cm, annot=True, cmap="Blues", fmt="g", xticklabels=labels, yticklabels=labels)
ax = plt.xlabel("Predicted Labels")
ax = plt.ylabel("True Labels")
../_images/tutorial-text-data_139_0.png

If you do not feel done with text data yet, there is always more to do. In this optional assignment, you can experiment with the number of epochs, learning rate, vector size, optimizer, neural network architecture, regularization, etc. Also, we only use a small subset of this dataset for performance issues. If you have a high-end computer, you can go to the beginning of this tutorial to increase the size of the subset.

Even during the preprocessing, we could have done some things differently, like making everything lowercase and removing punctuation. Be aware that every choice you make along the way trickles down into your pipeline and can have some effect on your results. Also, take the time to write the code to evaluate the model with more metrics, such as accuracy, precision, recall, and the F1 score.


1

Credit: this teaching material is created by Robert van Straten and revised by Alejandro Monroy under the supervision of Yen-Chia Hsu.