Tutorial (Text Data Processing)#

(Last updated: Jan 29, 2025)1

This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the Natural Language Processing (NLP) pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.

The AG’s News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification.

You can use the following links to jump to the tasks and assignments:

Task 3: Preprocess Text Data
Task 4: Unsupervised Learning - Topic Modelling
- Assignment for Task 4
Task 5: Supervised Learning - Topic Classification

Scenario#

The AG’s News Topic Classification Dataset is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end, make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks.

Import Packages#

Important

To make this notebook work, you need to install the packages by following the instructions in the preparation step. You can also copy this notebook (as well as the dataset) to Google Colab and run the notebook on it.

We put all the packages that are needed for this tutorial below:

import nltk
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import spacy

import torch
import torch.nn as nn
import torch.optim as optim

from gensim.models import Word2Vec

from nltk.corpus import (
    stopwords,
    wordnet
)
from nltk.stem import (
    SnowballStemmer,
    WordNetLemmatizer
)
from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import (
    adjusted_mutual_info_score,
    adjusted_rand_score,
    confusion_matrix
)

from tqdm.notebook import tqdm

from wordcloud import WordCloud

# Import the answers for the tasks
from util.answer import (
    answer_tokenize_and_lemmatize,
    answer_get_word_counts,
    answer_remove_stopwords,
    answer_get_index_of_top_n_items
)

# Import the utility functions that are provided
from util.util import (
    check_answer_df,
    check_answer_np,
    wordnet_pos,
    reformat_data,
    visualize_word_counts,
    add_spacy_doc,
    add_spacy_tokens,
    add_padded_tensors
)

# Add tqdm functions to pandas.
tqdm.pandas()

Task Answers#

Click on one of the following links to check answers for the assignments in this tutorial. Do not check the answers before practicing the tasks.

Click this for task answers if you open this notebook on your local machine
Click this for task answers if you view this notebook on a web browser

Utility File#

Click on one of the following links to check the provided functions in the utility file for this tutorial.

Click this for utility functions if you open this notebook on your local machine
Click this for utility functions if you view this notebook on a web browser

Task 3: Preprocess Text Data#

In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

For performance reasons, we will only use a small subset of this dataset.

df_train = df_train.groupby("Class Index").head(1000)
df_test = df_test.groupby("Class Index").head(100)

display(df_train, df_test)

	Class Index	Title	Description
0	3	Wall St. Bears Claw Back Into the Black (Reuters)	Reuters - Short-sellers, Wall Street's dwindli...
1	3	Carlyle Looks Toward Commercial Aerospace (Reu...	Reuters - Private investment firm Carlyle Grou...
2	3	Oil and Economy Cloud Stocks' Outlook (Reuters)	Reuters - Soaring crude prices plus worries\ab...
3	3	Iraq Halts Oil Exports from Main Southern Pipe...	Reuters - Authorities have halted oil export\f...
4	3	Oil prices soar to all-time record, posing new...	AFP - Tearaway world oil prices, toppling reco...
...	...	...	...
4865	2	IOA sets up committee to probe dope scandal	Athens, Aug 20. (PTI):In a belated damage-cont...
4866	2	FACTBOX-Jonathan Woodgate factbox	MADRID, Aug 20 (Reuters) - Factbox on England ...
4867	2	British canoe pair lose out	ATHENS (Reuters) - Slovakian twins Peter and P...
4895	2	U.S. Softball Team Posts Shutout No. 7 (AP)	AP - Cat Osterman struck out 10 in six innings...
4896	2	Paul Hamm's example	PAUL HAMM'S fall and rise are what make the Ol...

4000 rows × 3 columns

	Class Index	Title	Description
0	3	Fears for T N pension after talks	Unions representing workers at Turner Newall...
1	4	The Race is On: Second Private Team Sets Launc...	SPACE.com - TORONTO, Canada -- A second\team o...
2	4	Ky. Company Wins Grant to Study Peptides (AP)	AP - A company founded by a chemistry research...
3	4	Prediction Unit Helps Forecast Wildfires (AP)	AP - It's barely dawn when Mike Fitzpatrick st...
4	4	Calif. Aims to Limit Farm-Related Smog (AP)	AP - Southern California's smog-fighting agenc...
...	...	...	...
473	3	New Overtime Rules Take Effect	New Bush administration rules that scale back ...
479	3	Dollar Holds Gains, Fed Comments Help	TOKYO (Reuters) - The dollar held on to the p...
481	3	Dark arts of spin evident in phoney war for Abbey	THE phoney war over the fate of Abbey grinds o...
482	3	Controversial US Overtime Rules Take Effect	New overtime rules have taken effect in the Un...
484	3	SAS Braathens to cut Gatwick, Geneva flights	blackhawk writes quot;SAS Braathens, the Norw...

400 rows × 3 columns

As you can see, all the classes are distributed evenly in the train and test data.

display(df_train["Class Index"].value_counts(), df_test["Class Index"].value_counts())

Class Index
3    1000
4    1000
2    1000
1    1000
Name: count, dtype: int64

Class Index
3    100
4    100
2    100
1    100
Name: count, dtype: int64

To make the data more understandable, we will make the classes more understandable by adding a class column from the original Class Index column, containing the category of the news article. To process both the title and news text together, we will combine the Title and Description columns into one text column. We will deal with just the train data until the point where we need the test data again. We provide the reformat_data function in the utility file for this.

df_train_reformat = reformat_data(df_train)
display(df_train_reformat)

	class_idx	class	text
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...
4	3	Business	Oil prices soar to all-time record, posing new...
...	...	...	...
4865	2	Sports	IOA sets up committee to probe dope scandal At...
4866	2	Sports	FACTBOX-Jonathan Woodgate factbox MADRID, Aug ...
4867	2	Sports	British canoe pair lose out ATHENS (Reuters) -...
4895	2	Sports	U.S. Softball Team Posts Shutout No. 7 (AP) AP...
4896	2	Sports	Paul Hamm's example PAUL HAMM'S fall and rise ...

4000 rows × 3 columns

Tokenization#

Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. It helps language models to understand and analyze text data by breaking it down into smaller, more manageable pieces. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing NLP applications.

For example, in languages like English, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like “can’t” and hyphenated words like “self-driving”. And in Dutch, where multiple nouns can be combined into one bigger noun without any delimiter this can be hard. How would you tokenize “hippopotomonstrosesquippedaliofobie”? In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult.

To illustrate the use of tokenization, let’s consider the following example, which tokenizes a sample text using the word_tokenize function from the NLTK package. That function uses a pre-trained tokenization model for English.

# Sample text.
text = "The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day."

# Tokenize the text.
tokens = word_tokenize(text)

# Print the text and the tokens.
print("Original text:", text)
print("Tokenized text:", tokens)

Original text: The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
Tokenized text: ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cats', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Part-of-speech tagging#

Part-of-speech (POS) tagging is the process of assigning each word in a text corpus with a specific part-of-speech tag based on its context and definition. The tags typically include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and more. POS tagging can help other NLP tasks disambiguate a token somewhat due to the added context.

pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.'), ('The', 'DT'), ('cats', 'NNS'), ('could', 'MD'), ("n't", 'RB'), ('wait', 'VB'), ('to', 'TO'), ('sleep', 'VB'), ('all', 'DT'), ('day', 'NN'), ('.', '.')]

Stemming / Lemmatization#

Stemming and lemmatization are two common techniques used in NLP to preprocess and normalize text data. Both techniques involve transforming words into their root form, but they differ in their approach and the level of normalization they provide.

Stemming is a technique that involves reducing words to their base or stem form by removing any affixes or suffixes. For example, the stem of the word “lazily” would be “lazi”. Stemming is a simple and fast technique that can be useful. However, it can also produce inaccurate or incorrect results since it does not consider the context or part of speech of the word.

Lemmatization, on the other hand, is a more sophisticated technique that involves identifying the base or dictionary form of a word, also known as the lemma. Unlike stemming, lemmatization can consider the part of speech (POS) of the word, which can make it more accurate and reliable. With lemmatization, the lemma of the word “lazily” would be “lazy”. Lemmatization can be slower and more complex than stemming but provides a higher level of normalization.

Also, lemmatization requires POS tagging. Since the POS tagging logic in the nltk package is different from the wordnet package, we provide a function wordnet_pos in the utility file for converting the POS tags. The input of the wordnet_pos function is a nltk POS tag, and the output of the function is a wordnet POS tag.

# Initialize the stemmer and lemmatizer.
stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization seperately on the tokens.
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token, wordnet_pos(tag))
                     for token, tag in nltk.pos_tag(tokens)]

# Print the results.
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)

Stemmed text: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.', 'the', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Lemmatized text: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Stopword Removal#

Stopword removal is a common technique used in NLP to preprocess and clean text data by removing words that are considered to be of little or no value in terms of conveying meaning or information. These words are called “stopwords” and they include common words such as “the”, “a”, “an”, “and”, “or”, “but”, and so on.

The purpose of stopword removal in NLP is to improve the accuracy and efficiency of text analysis and processing by reducing the noise and complexity of the data. Stopwords are often used to form grammatical structures in a sentence, but they do not carry much meaning or relevance to the main topic or theme of the text. So by removing these words, we can reduce the dimensionality of the text data, improve the performance of machine learning models, and speed up the processing of text data. NLTK has a predefined list of stopwords for English.

# English stopwords in NLTK.
stopwords_list = stopwords.words('english')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Assignment for Task 3.1: Tokenization and Lemmatization#

The first step is to tokenize and lemmatize the sentences. Your task (which is your assignment) is to write functions to do the following:

Since we want to use our text to make a model later on, we need to preprocess it. Add a tokens column to the df_train dataframe with the text tokenized, then lemmatize those tokens. You must use the POS tags when lemmatizing.
- Hint: Use the pandas.Series.apply function with the imported nltk.tokenize.word_tokenize function. Recall that you can use the pd.Series.apply? syntax in a code cell for more information.
- Hint: use the nltk.stem.WordNetLemmatizer.lemmatize function to lemmatize a token. Use the wordnet_pos function to obtain the POS tag for the lemmatizer.
Tokenizing and lemmatizing the entire dataset can take a while too. Use tqdm and the pandas.Series.progress_apply to show progress bars for the operations.

def tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "text" column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added "tokens" column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dataframe that looks like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

# This part of code will take some time to run.
answer_df_with_tokens = answer_tokenize_and_lemmatize(df_train_reformat)
answer_df_with_tokens.groupby("class").head(n=5)

	class_idx	class	text	tokens
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Into, the, Blac...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, from, Main, Southe...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, to, all-time, record, ,, po...
78	4	Sci/Tech	'Madden,' 'ESPN' Football Score in Different W...	['Madden, ,, ', 'ESPN, ', Football, Score, in,...
79	4	Sci/Tech	Group to Propose New High-Speed Wireless Forma...	[Group, to, Propose, New, High-Speed, Wireless...
80	4	Sci/Tech	AOL to Sell Cheap PCs to Minorities and Senior...	[AOL, to, Sell, Cheap, PCs, to, Minorities, an...
81	4	Sci/Tech	Companies Approve New High-Capacity Disc Forma...	[Companies, Approve, New, High-Capacity, Disc,...
82	4	Sci/Tech	Missing June Deals Slow to Return for Software...	[Missing, June, Deals, Slow, to, Return, for, ...
448	2	Sports	Phelps, Thorpe Advance in 200 Freestyle (AP) A...	[Phelps, ,, Thorpe, Advance, in, 200, Freestyl...
449	2	Sports	Reds Knock Padres Out of Wild-Card Lead (AP) A...	[Reds, Knock, Padres, Out, of, Wild-Card, Lead...
450	2	Sports	Dreaming done, NBA stars awaken to harsh Olymp...	[Dreaming, do, ,, NBA, star, awaken, to, harsh...
451	2	Sports	Indians Beat Twins 7-1, Nearing AL Lead (AP) A...	[Indians, Beat, Twins, 7-1, ,, Nearing, AL, Le...
452	2	Sports	Galaxy, Crew Play to 0-0 Tie (AP) AP - Kevin H...	[Galaxy, ,, Crew, Play, to, 0-0, Tie, (, AP, )...
492	1	World	Venezuelans Vote Early in Referendum on Chavez...	[Venezuelans, Vote, Early, in, Referendum, on,...
493	1	World	S.Koreans Clash with Police on Iraq Troop Disp...	[S.Koreans, Clash, with, Police, on, Iraq, Tro...
494	1	World	Palestinians in Israeli Jails Start Hunger Str...	[Palestinians, in, Israeli, Jails, Start, Hung...
495	1	World	Seven Georgian soldiers wounded as South Osset...	[Seven, Georgian, soldier, wound, a, South, Os...
496	1	World	Rwandan Troops Arrive in Darfur (AP) AP - Doze...	[Rwandan, Troops, Arrive, in, Darfur, (, AP, )...

The code below tests if the output of your function matches the expected output.

df_with_tokens = tokenize_and_lemmatize(df_train_reformat)
check_answer_df(df_with_tokens, answer_df_with_tokens)

Test case 1 failed.

Your output is:
None

Expected output is:
      class_idx     class                                               text  \
0             3  Business  Wall St. Bears Claw Back Into the Black (Reute...   
1             3  Business  Carlyle Looks Toward Commercial Aerospace (Reu...   
2             3  Business  Oil and Economy Cloud Stocks' Outlook (Reuters...   
3             3  Business  Iraq Halts Oil Exports from Main Southern Pipe...   
4             3  Business  Oil prices soar to all-time record, posing new...   
...         ...       ...                                                ...   
4865          2    Sports  IOA sets up committee to probe dope scandal At...   
4866          2    Sports  FACTBOX-Jonathan Woodgate factbox MADRID, Aug ...   
4867          2    Sports  British canoe pair lose out ATHENS (Reuters) -...   
4895          2    Sports  U.S. Softball Team Posts Shutout No. 7 (AP) AP...   
4896          2    Sports  Paul Hamm's example PAUL HAMM'S fall and rise ...   

                                                 tokens  
0     [Wall, St., Bears, Claw, Back, Into, the, Blac...  
1     [Carlyle, Looks, Toward, Commercial, Aerospace...  
2     [Oil, and, Economy, Cloud, Stocks, ', Outlook,...  
3     [Iraq, Halts, Oil, Exports, from, Main, Southe...  
4     [Oil, price, soar, to, all-time, record, ,, po...  
...                                                 ...  
4865  [IOA, set, up, committee, to, probe, dope, sca...  
4866  [FACTBOX-Jonathan, Woodgate, factbox, MADRID, ...  
4867  [British, canoe, pair, lose, out, ATHENS, (, R...  
4895  [U.S., Softball, Team, Posts, Shutout, No, ., ...  
4896  [Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...  

[4000 rows x 4 columns]

Assignment for Task 3.2: Word Counts#

To see what the most used words per class are, create a new, seperate dataframe with token counts.

Hint: use the pandas.Series.apply and str.isalpha() functions to filter out non-alphabetical tokens.
Hint: use the pandas.DataFrame.explode to create one row per class and token.
Hint: use pandas.DataFrame.groupby with .size() afterwards or pandas.DataFrame.pivot_table with size as the aggfunc to obtain the occurences per class.
Hint: use the pandas.Series.reset_index function to obtain a dataframe with [class, tokens, count] as the columns.
Hint: use the pandas.DataFrame.sort_values function for sorting a dataframe.

def get_word_counts(df, token_col="tokens"):
    """
    Generate dataframes with the word counts for each class in the data.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the "class" and "tokens" columns.
    token_col : str
        Name of the column that stores the tokens.

    Returns
    -------
    pandas.DataFrame
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dictionary of dataframes (one per class) that look like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

answer_word_counts = answer_get_word_counts(answer_df_with_tokens, token_col="tokens")
answer_word_counts.groupby("class").head(n=5)

	class	tokens	count
3541	Business	the	1460
0	Business	a	1217
3587	Business	to	923
1743	Business	in	776
2424	Business	on	687
8929	Sci/Tech	the	1660
3961	Sci/Tech	a	1121
8997	Sci/Tech	to	1074
7319	Sci/Tech	of	939
4423	Sci/Tech	be	744
13816	Sports	the	2280
9565	Sports	a	978
11631	Sports	in	815
13888	Sports	to	793
12471	Sports	of	714
18749	World	the	1653
14384	World	a	1241
18808	World	to	1172
16474	World	in	1164
17302	World	of	908

The code below tests if the output of your function matches the expected output.

word_counts = get_word_counts(df_with_tokens, token_col="tokens")
check_answer_df(answer_word_counts, word_counts)

Test case 1 failed.

Your output is:
          class      tokens  count
3541   Business         the   1460
0      Business           a   1217
3587   Business          to    923
1743   Business          in    776
2424   Business          on    687
...         ...         ...    ...
19259     World      zalmay      1
19261     World     zeitoun      1
19262     World        zesn      1
19263     World         zim      1
19265     World  zimbabwean      1

[19268 rows x 3 columns]

Expected output is:
None

We provide the visualize_word_counts function in the utility file to visualize the word counts that you just computed using the wordcloud package.

visualize_word_counts(answer_word_counts)

Assignment for Task 3.3: Stop Words Removal#

The stop words make it difficult for us to identify representative words for each class. Let’s display the word counts using the data without stop words. But we need to remove the stop words first. Your task (which is your assignment) is to write functions to do the following:

Remove the stopwords from the tokens column in the dataframe.
- Hint: use the pandas.DataFrame.isin function.
- Hint: use the stopwords_list variable to help you check if a token is a stop word.

def remove_stopwords(df):
    """
    Remove stopwords from the tokens.

    Parameters
    ----------
    df : pandas.DataFrame
        There should be three columns in this dataframe.
        The "class" column shows the document class.
        The "tokens" column means tokens in the document class.
        The "count" column means the number of appearances of each token in the class.
        The dataframe should be sorted by the "class" and "count" columns.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the stopwords rows removed.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

Our goal is to have a dictionary of dataframes (one per class) that look like the following. For simplicity, we only show the top 5 most frequent words. Your data frame should have more rows.

answer_word_counts_no_stopword = answer_remove_stopwords(answer_word_counts)
answer_word_counts_no_stopword.groupby("class").head(n=5)

	class	tokens	count
2948	Business	reuters	483
2351	Business	new	297
2698	Business	price	281
3041	Business	say	281
2419	Business	oil	278
4209	Sci/Tech	ap	236
7228	Sci/Tech	new	224
8237	Sci/Tech	say	186
4952	Sci/Tech	company	137
7031	Sci/Tech	microsoft	121
9817	Sports	athens	354
9747	Sports	ap	315
12497	Sports	olympic	244
14281	Sports	win	223
11335	Sports	gold	208
18159	World	say	305
18012	World	reuters	276
14598	World	ap	261
17194	World	najaf	154
14474	World	afp	149

The code below tests if the output of your function matches the expected output.

word_counts_no_stopword = remove_stopwords(word_counts)
check_answer_df(word_counts_no_stopword, answer_word_counts_no_stopword)

Test case 1 failed.

Your output is:
None

Expected output is:
          class      tokens  count
2948   Business     reuters    483
2351   Business         new    297
2698   Business       price    281
3041   Business         say    281
2419   Business         oil    278
...         ...         ...    ...
19259     World      zalmay      1
19261     World     zeitoun      1
19262     World        zesn      1
19263     World         zim      1
19265     World  zimbabwean      1

[18787 rows x 3 columns]

visualize_word_counts(answer_word_counts_no_stopword)

Another Option: spaCy#

spaCy is another library used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and much more. It provides pre-trained models for different languages and domains, which can be used as-is but also can be fine-tuned on a specific task or domain.

In an object-oriented way, spaCy can be thought of as a collection of classes and objects that work together to perform NLP tasks. Some of the important functions and classes in spaCy include:

nlp: The core function that provides the main functionality of spaCy. It is used to process text and create a Doc object.
Doc: A container for accessing linguistic annotations like tokens, part-of-speech tags, named entities, and dependency parse information. It is created by the nlp function and represents a processed document.
Token: An object representing a single token in a Doc object. It contains information like the token text, part-of-speech tag, lemma, embedding, and much more.

When a text is processed by spaCy, it is first passed to the nlp function, which uses the loaded model to tokenize the text and applies various linguistic annotations like part-of-speech tagging, named entity recognition, and dependency parsing in the background. The resulting annotations are stored in a Doc object, which can be accessed and manipulated using various methods and attributes.

# Load the small English model in spaCy.
# Disable Named Entity Recognition and the parser in the model pipeline since we're not using it.
# Check the following website for the spaCy NLP pipeline:
# - https://spacy.io/usage/processing-pipelines
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# Process the text using spaCy.
doc = nlp(text)

# This becomes a spaCy Doc object, which prints nicely as the original string.
print(doc)

The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.

The Doc object can be iterated over to access each Token object in the document. We can also directly access multiple attributes of the Token objects. For example, we can directly access the lemma of the token with Token.lemma_ and check if a token is a stop word with Token.is_stop. To make it easy to see them, we put them in a data frame.

spacy_doc_attributes = [(token, token.lemma_, token.is_stop) for token in doc]
pd.DataFrame(data=spacy_doc_attributes, columns=["token", "lemma", "is_stopword"])

	token	lemma	is_stopword
0	The	the	True
1	quick	quick	False
2	brown	brown	False
3	fox	fox	False
4	jumped	jump	False
5	over	over	True
6	the	the	True
7	lazy	lazy	False
8	dog	dog	False
9	.	.	False
10	The	the	True
11	cats	cat	False
12	could	could	True
13	n't	not	True
14	wait	wait	False
15	to	to	True
16	sleep	sleep	False
17	all	all	True
18	day	day	False
19	.	.	False

The above example only deals with one sentence. Now we need to deal with all the sentences in all the classes. We provide a add_spacy_doc function in the utility file to add a column with a Doc representation of the text column to the dataframe. Now we can add the spaCy tokens using the above function. This step will take some time since it needs to process all the sentences. So we added a progress bar using the tqdm package.

df_with_nltk_tokens_and_spacy_doc = add_spacy_doc(answer_df_with_tokens, nlp)
display(df_with_nltk_tokens_and_spacy_doc)

	class_idx	class	text	tokens	doc
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Into, the, Blac...	(Wall, St., Bears, Claw, Back, Into, the, Blac...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, and, Economy, Cloud, Stocks, ', Outlook,...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, from, Main, Southe...	(Iraq, Halts, Oil, Exports, from, Main, Southe...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, to, all-time, record, ,, po...	(Oil, prices, soar, to, all, -, time, record, ...
...	...	...	...	...	...
4865	2	Sports	IOA sets up committee to probe dope scandal At...	[IOA, set, up, committee, to, probe, dope, sca...	(IOA, sets, up, committee, to, probe, dope, sc...
4866	2	Sports	FACTBOX-Jonathan Woodgate factbox MADRID, Aug ...	[FACTBOX-Jonathan, Woodgate, factbox, MADRID, ...	(FACTBOX, -, Jonathan, Woodgate, factbox, MADR...
4867	2	Sports	British canoe pair lose out ATHENS (Reuters) -...	[British, canoe, pair, lose, out, ATHENS, (, R...	(British, canoe, pair, lose, out, ATHENS, (, R...
4895	2	Sports	U.S. Softball Team Posts Shutout No. 7 (AP) AP...	[U.S., Softball, Team, Posts, Shutout, No, ., ...	(U.S., Softball, Team, Posts, Shutout, No, ., ...
4896	2	Sports	Paul Hamm's example PAUL HAMM'S fall and rise ...	[Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...	(Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...

4000 rows × 5 columns

We also provide a add_spacy_tokens function in the utility file to add the spacy tokens to our original dataframe. We can run the code below to add the spacy tokens.

df_with_nltk_tokens_and_spacy_tokens = add_spacy_tokens(df_with_nltk_tokens_and_spacy_doc)
display(df_with_nltk_tokens_and_spacy_tokens)

	class_idx	class	text	tokens	doc	spacy_tokens
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Into, the, Blac...	(Wall, St., Bears, Claw, Back, Into, the, Blac...	[Wall, Bears, Claw, Black, Reuters, Reuters, s...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...	[Carlyle, look, Commercial, Aerospace, Reuters...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, and, Economy, Cloud, Stocks, ', Outlook,...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...	[oil, Economy, Cloud, Stocks, Outlook, Reuters...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, from, Main, Southe...	(Iraq, Halts, Oil, Exports, from, Main, Southe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, to, all-time, record, ,, po...	(Oil, prices, soar, to, all, -, time, record, ...	[oil, price, soar, time, record, pose, new, me...
...	...	...	...	...	...	...
4865	2	Sports	IOA sets up committee to probe dope scandal At...	[IOA, set, up, committee, to, probe, dope, sca...	(IOA, sets, up, committee, to, probe, dope, sc...	[IOA, set, committee, probe, dope, scandal, At...
4866	2	Sports	FACTBOX-Jonathan Woodgate factbox MADRID, Aug ...	[FACTBOX-Jonathan, Woodgate, factbox, MADRID, ...	(FACTBOX, -, Jonathan, Woodgate, factbox, MADR...	[FACTBOX, Jonathan, Woodgate, factbox, MADRID,...
4867	2	Sports	British canoe pair lose out ATHENS (Reuters) -...	[British, canoe, pair, lose, out, ATHENS, (, R...	(British, canoe, pair, lose, out, ATHENS, (, R...	[british, canoe, pair, lose, ATHENS, Reuters, ...
4895	2	Sports	U.S. Softball Team Posts Shutout No. 7 (AP) AP...	[U.S., Softball, Team, Posts, Shutout, No, ., ...	(U.S., Softball, Team, Posts, Shutout, No, ., ...	[Softball, Team, Posts, Shutout, AP, AP, Cat, ...
4896	2	Sports	Paul Hamm's example PAUL HAMM'S fall and rise ...	[Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...	(Paul, Hamm, 's, example, PAUL, HAMM, 'S, fall...	[Paul, Hamm, example, PAUL, HAMM, fall, rise, ...

4000 rows × 6 columns

Now we can use the function that we wrote before to get the word count from the spacy tokens.

spacy_word_counts = answer_get_word_counts(df_with_nltk_tokens_and_spacy_tokens, token_col="spacy_tokens")
spacy_word_counts.groupby("class").head(n=5)

	class	spacy_tokens	count
2793	Business	reuters	483
2563	Business	price	317
2240	Business	new	298
2880	Business	say	287
2299	Business	oil	281
3931	Sci/Tech	ap	236
6774	Sci/Tech	new	226
7739	Sci/Tech	say	161
4625	Sci/Tech	company	140
6584	Sci/Tech	microsoft	124
9178	Sports	athens	356
9112	Sports	ap	315
11775	Sports	olympic	262
13447	Sports	win	233
10642	Sports	gold	212
17110	World	say	294
16967	World	reuters	276
13728	World	ap	261
16185	World	najaf	157
13622	World	afp	150

Task 4: Unsupervised Learning - Topic Modeling#

Topic modelling is a technique used in NLP that aims to identify the underlying topics or themes in a collection of texts. One way to perform topic modelling is using the probabilistic model Latent Dirichlet Allocation (LDA).

LDA assumes that each document in a collection is a mixture of different topics, and each topic is a probability distribution over a set of words. The model then infers the underlying topic distribution for each document in the collection and the word distribution for each topic. LDA is trained using an iterative algorithm that maximizes the likelihood of observing the given documents.

To use LDA, we need to represent the documents as a bag of words, where the order of the words is ignored and only the frequency of each word in the document is considered. This bag-of-words representation allows us to represent each document as a vector of word frequencies, which can be used as input to the LDA algorithm. Computing LDA might take a moment on our dataset size.

# Convert preprocessed text to bag-of-words representation using CountVectorizer.
vectorizer = CountVectorizer(max_features=1000)

We will use the fit_transform function in the vectorizer. But in this case, we need a string that represents a sentence as the input. So, we can just join all the tokens together into one string. We also reset the index for consistency.

df_strings = df_with_nltk_tokens_and_spacy_tokens["spacy_tokens"].apply(lambda x: " ".join(x))
df_strings = df_strings.reset_index(drop=True)
df_strings

     Wall Bears Claw Black Reuters Reuters short se...
     Carlyle look Commercial Aerospace Reuters Reut...
     oil Economy Cloud Stocks Outlook Reuters Reute...
     Iraq Halts Oil Exports Main Southern Pipeline ...
     oil price soar time record pose new menace eco...
                              ...                        
  IOA set committee probe dope scandal Athens Au...
  FACTBOX Jonathan Woodgate factbox MADRID Aug R...
  british canoe pair lose ATHENS Reuters Slovaki...
  Softball Team Posts Shutout AP AP Cat Osterman...
  Paul Hamm example PAUL HAMM fall rise Olympics...
Name: spacy_tokens, Length: 4000, dtype: object

Then, we can use the fit_transform function to get the bag of words vector.

X = vectorizer.fit_transform(df_strings.values)

We convert the original matrix to a data frame to make it easier to see the bag of words. The columns indicate tokens, and the values for each cell indicate the word counts. The number of columns in the data frame matches the max_features parameter in the CountVectorizer. The number of rows matches the size of the training data.

pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

	accept	accord	accounting	accuse	action	activity	ad	add	advance	afghanistan	...	wrap	xinhuanet	xp	yahoo	yankees	year	yesterday	york	young	yukos
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3995	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3996	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3997	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3998	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3999	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

4000 rows × 1000 columns

Now we have the bag of words vector. We can use the vector for LDA topic modeling.

# Define the number of topics to model with LDA.
num_topics = 4

# Fit LDA to the feature matrix. Verbose so we know what iteration we are on.
# The random state is just for producing consistent results.
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42, verbose=True)
f = lda.fit(X)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10

Now we can check the topic vectors in the LDA model. Each vector represents the topic in a high dimensional space. The high dimensional space is formed by using the word tokens. So, the vectors can also be viewed as weights that represents the number of importance that a word token was assigned to the topic. In the following code block, we print the shape of the vectors. The row size should match the number of topics that we set before. The column size should match the max_features parameter, which means the number of words.

lda.components_.shape

(4, 1000)

Assignment for Task 4#

We want to get the weights for each word in each topic and visualize them using word clouds. In the above case, the shape should be (4, 1000), which means we have 4 topics, and each topic is represented by a distribution (i.e., weights) of 1000 words. To make the world cloud visualization simple, we only wants to use the top n number of words with the highest weights.

Your task (which is your assignment) is to write functions to do the following:

Given a 1D NumPy array, return the indexes of the top n number of items according to their values. In other words, we want the indexes that can help us select the highest n values. For example, for n=3 in array [3,1,2,4,0], the function should return [3,0,2], because the highest value is 4 with index 3 in the original array, and so on.
- Hint: use the numpy.argsort function.
Notice that the numpy.argsort function gives you the indexes from the array items with the lowest value, which is not what we want. You need to figure out a way to reverse a numpy array and select the top n items.

def get_index_of_top_n_items(array, n):
    """
    Given an NumPy array, return the indexes of the top "n" number of items according to their values.

    Parameters
    ----------
    array : numpy.ndarray
        A 1D NumPy array.
    n : int
        The top "n" number of items that we want.

    Returns
    -------
    numpy.ndarray
        The indexes of the top "n" items.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The following code shows the example that we mentioned above.

A = np.array([3,1,2,4,0])
answer_top_n_for_A = answer_get_index_of_top_n_items(A, n=3)
answer_top_n_for_A

array([3, 0, 2])

The code below tests if the output of your function matches the expected output.

B = lda.components_[0]
top_n_for_topic_0 = get_index_of_top_n_items(B, n=10)
answer_top_n_for_topic_0 = answer_get_index_of_top_n_items(B, n=10)
check_answer_np(top_n_for_topic_0, answer_top_n_for_topic_0)

Test case 1 failed.
Your output is:
None
Expected output is:
[ 38  56 602 978 361 884 526 747 995 603]

We can now use the function that we just implemented in the following function to help us get the weights for the top n words for each topic.

def get_word_weights_for_topics(lda_model, vectorizer, n=100):
    """
    Get weights for words for each topic.

    Parameters
    ----------
    lda_model : sklearn.decomposition.LatentDirichletAllocation
        The LDA model.
    vectorizer : sklearn.feature_extraction.text.CountVectorizer
        The count vectorizer.
    n : int
        Number of important words that we want to get.

    Returns
    -------
    dict of pandas.DataFrame
        A dictionary with data frames.
    """
    words = vectorizer.get_feature_names_out()
    n = len(words) if n is None else n
    topic_word_weights = {}

    for idx, topic_vector in enumerate(lda_model.components_):
        top_features_ind = answer_get_index_of_top_n_items(topic_vector, n=n)
        top_features = [words[i] for i in top_features_ind]
        weights = topic_vector[top_features_ind]
        df = pd.DataFrame(weights, index=top_features, columns=["weight"])
        df = df.sort_values(by="weight", ascending=False)
        topic_word_weights[idx] = df

    return topic_word_weights

Now we can take a look at the data first. For simplicity, we only print the first 10 important words for each topic.

topic_word_weights = get_word_weights_for_topics(lda, vectorizer, n=100)
for k, v in topic_word_weights.items():
    print(f"\nTopic #{k}:")
    print(" ".join(v.index[0:10]))
    display(v.iloc[0:10])

Topic #0:
ap athens olympic win gold team man reuters year olympics

	weight
ap	403.515105
athens	403.246789
olympic	309.246800
win	286.162784
gold	243.246784
team	210.578474
man	199.276155
reuters	190.315506
year	182.915320
olympics	175.246315

Topic #1:
reuters price oil google new say stock share york high

	weight
reuters	613.090454
price	447.227585
oil	333.991865
google	330.413708
new	327.240235
say	302.630197
stock	251.689804
share	240.112990
york	220.173493
high	214.368258

Topic #2:
new ap company say microsoft security software service plan time

	weight
new	294.138710
ap	211.640513
company	203.847837
say	197.105016
microsoft	128.118405
security	118.300459
software	117.660863
service	109.950400
plan	107.479263
time	107.182716

Topic #3:
say reuters ap iraq najaf president plan kill minister wednesday

	weight
say	257.517882
reuters	240.920321
ap	225.568027
iraq	168.908879
najaf	157.249665
president	154.238458
plan	150.074533
kill	133.206681
minister	132.223089
wednesday	110.130430

Then, we can use the word weights to create word clouds.

# Generate a word cloud for each topic.
for topic_idx, words in topic_word_weights.items():
    frequencies = dict(zip(words.index, words["weight"]))
    wordcloud = WordCloud(background_color="white", width=1000, height=500).generate_from_frequencies(frequencies)
    # Display image
    plt.axis("off")
    plt.title(f"Topic {topic_idx}")
    plt.imshow(wordcloud)
    plt.show()

Compare this with the word cloud visualizations in the pre-processing step previously. Does the LDA topic modeling represent the actural four document classes in the training data? What do you think?

For this task, we mainly use a qualitative way to evaluate topic modeling by visually inspecting the word clouds. There are also quantiative ways to evaluate the models, but they are not covered in this course. If you are interested in this, check the following resources:

Task 5: Supervised Learning - Topic Classification#

Topic classification is a task in NLP that involves automatically assigning a given text document to one or more predefined categories or topics. This task is essential for various applications, such as document organization, search engines, sentiment analysis, and more.

In recent years, deep learning models have shown remarkable performance in various NLP tasks, including topic classification. We will explore a neural network-based approach for topic classification using the PyTorch framework. PyTorch provides an efficient way to build and train neural networks with a high degree of flexibility and ease of use.

Compute Word Embeddings#

We will first look at word embeddings, which represent words as vectors in a high-dimensional space. The key idea behind word embeddings is that words with similar meanings tend to appear in similar contexts, and therefore their vector representations should be close together in this high-dimensional space. Word embeddings have been widely used in various NLP tasks such as sentiment analysis, machine translation, and information retrieval.

There are several techniques to generate word embeddings, but one of the most popular methods is the Word2Vec algorithm, which is based on a neural network architecture. Word2Vec learns embeddings by predicting the probability of a word given its context (continuous bag of words or skip-gram model). The output of the network is a set of word vectors that can be used as embeddings.

We can train a Word2Vec model ourselves, but keep in mind that later on, it’s not nice if we don’t have embeddings for certain words in the test set. So let’s first apply the familiar preprocessing steps to the test set:

# Reformat the test set.
df_test_reformat = reformat_data(df_test)

# NLTK preprocessing.
df_test_with_tokens = answer_tokenize_and_lemmatize(df_test_reformat)

# spaCy preprocessing.
df_test_with_nltk_tokens_and_spacy_tokens = add_spacy_tokens(add_spacy_doc(df_test_with_tokens, nlp))

display(df_test_with_nltk_tokens_and_spacy_tokens)

	class_idx	class	text	tokens	doc	spacy_tokens
0	3	Business	Fears for T N pension after talks Unions repre...	[Fears, for, T, N, pension, after, talk, Union...	(Fears, for, T, N, pension, after, talks, Unio...	[fear, t, N, pension, talk, Unions, represent,...
1	4	Sci/Tech	The Race is On: Second Private Team Sets Launc...	[The, Race, be, On, :, Second, Private, Team, ...	(The, Race, is, On, :, Second, Private, Team, ...	[race, Second, private, Team, Sets, Launch, Da...
2	4	Sci/Tech	Ky. Company Wins Grant to Study Peptides (AP) ...	[Ky., Company, Wins, Grant, to, Study, Peptide...	(Ky., Company, Wins, Grant, to, Study, Peptide...	[Company, win, Grant, Study, Peptides, AP, AP,...
3	4	Sci/Tech	Prediction Unit Helps Forecast Wildfires (AP) ...	[Prediction, Unit, Helps, Forecast, Wildfires,...	(Prediction, Unit, Helps, Forecast, Wildfires,...	[prediction, Unit, help, Forecast, Wildfires, ...
4	4	Sci/Tech	Calif. Aims to Limit Farm-Related Smog (AP) AP...	[Calif, ., Aims, to, Limit, Farm-Related, Smog...	(Calif., Aims, to, Limit, Farm, -, Related, Sm...	[aim, Limit, Farm, relate, Smog, AP, AP, South...
...	...	...	...	...	...	...
473	3	Business	New Overtime Rules Take Effect New Bush admini...	[New, Overtime, Rules, Take, Effect, New, Bush...	(New, Overtime, Rules, Take, Effect, New, Bush...	[New, Overtime, Rules, Effect, New, Bush, admi...
479	3	Business	Dollar Holds Gains, Fed Comments Help TOKYO (...	[Dollar, Holds, Gains, ,, Fed, Comments, Help,...	(Dollar, Holds, Gains, ,, Fed, Comments, Help,...	[Dollar, Holds, Gains, Fed, comment, help, TOK...
481	3	Business	Dark arts of spin evident in phoney war for Ab...	[Dark, art, of, spin, evident, in, phoney, war...	(Dark, arts, of, spin, evident, in, phoney, wa...	[dark, art, spin, evident, phoney, war, Abbey,...
482	3	Business	Controversial US Overtime Rules Take Effect Ne...	[Controversial, US, Overtime, Rules, Take, Eff...	(Controversial, US, Overtime, Rules, Take, Eff...	[controversial, Overtime, Rules, Effect, new, ...
484	3	Business	SAS Braathens to cut Gatwick, Geneva flights b...	[SAS, Braathens, to, cut, Gatwick, ,, Geneva, ...	(SAS, Braathens, to, cut, Gatwick, ,, Geneva, ...	[SAS, Braathens, cut, Gatwick, Geneva, flight,...

400 rows × 6 columns

To obtain the complete model, we combine the tokens column into one series and call the Word2Vec function.

# Rename the very long variables
df_train_preprocessd = df_with_nltk_tokens_and_spacy_tokens
df_test_preprocessd = df_test_with_nltk_tokens_and_spacy_tokens

# Get all tokens into one series.
tokens_both = pd.concat([df_train_preprocessd["tokens"], df_test_preprocessd["tokens"]])

# Train a Word2Vec model on the NLTK tokens.
w2v_model = Word2Vec(tokens_both.values, vector_size=40, min_count=1)

To obtain the embeddings, we can use the Word2Vec.wv[word] syntax. To get multiple vectors nicely next to each other in a 2D matrix, we can call numpy.vstack.

print(np.vstack([w2v_model.wv[word] for word in ["rain", "cat", "dog"]]))

[[-2.60392636e-01  3.61957699e-02  1.09209269e-01  1.59883693e-01
   7.47058615e-02 -1.83978200e-01  1.15174890e-01  1.41995341e-01
  -1.27571225e-01  4.41441536e-01 -7.66160041e-02 -4.21129227e-01
  -7.70293251e-02 -1.32060111e-01  4.39062655e-01  2.57349819e-01
  -6.41567782e-02  1.71880454e-01  6.53994977e-02 -2.28234261e-01
  -1.31064489e-01  1.52635470e-03  2.27316573e-01  1.49013713e-01
  -2.45736703e-01  1.41700311e-02 -9.18157026e-02  3.44747216e-01
  -3.31136227e-01 -1.63988218e-01  3.18268031e-01 -1.14187524e-01
  -9.01155993e-02 -3.15312743e-01 -2.72045583e-02  1.96554258e-01
   7.19608963e-01 -4.33225513e-01 -3.80716771e-01 -3.01050067e-01]
 [-5.55762206e-04  1.75153557e-02  1.63081214e-02  4.37505618e-02
   1.91775504e-02 -2.68786456e-02 -9.65814292e-03  4.77873608e-02
   1.68873817e-02  7.72785023e-02 -1.24694523e-03 -4.48038988e-02
  -3.69028486e-02  1.08696317e-04  5.88426664e-02  4.69341688e-02
  -9.38272290e-03  8.82259570e-03  2.31929235e-02 -4.38504890e-02
  -1.66977551e-02 -2.26898771e-02  3.22737843e-02  2.31878553e-02
  -1.98843461e-02  5.50923310e-03  1.77841890e-03  4.55366671e-02
  -2.32978743e-02 -3.88195775e-02  6.80615827e-02 -2.95684878e-02
  -2.69708876e-02 -5.04455604e-02 -1.28503460e-02  2.98611298e-02
   1.10825963e-01 -7.42685422e-02 -6.67480677e-02 -5.01767062e-02]
 [-7.88029432e-02  3.57245207e-02  1.87077932e-02  8.16702172e-02
  -3.05961762e-02 -5.07439226e-02  3.81175242e-02  3.65127474e-02
  -5.89498617e-02  1.90558270e-01 -2.58215386e-02 -1.45636722e-01
  -3.71700376e-02 -7.68104494e-02  1.64298996e-01  1.00462683e-01
   7.34513486e-03  2.07834784e-02  4.85605095e-03 -8.80082399e-02
  -4.16397192e-02 -2.22325884e-02  1.00829713e-01  1.42552294e-02
  -7.83072636e-02 -1.89409941e-03 -3.98297459e-02  1.38811961e-01
  -1.12817809e-01 -3.60937640e-02  1.17614634e-01 -6.38329387e-02
  -2.99992543e-02 -8.16034526e-02 -3.84166278e-02  1.12850294e-01
   3.21655065e-01 -1.53566912e-01 -1.89866170e-01 -6.67793155e-02]]

The spaCy model we used has a Tok2Vec algorithm in its pipeline, so we can directly access the 2D matrix of all word vectors on a document with the Doc.tensor attribute. Keep in mind this still contains the embeddings of the stopwords.

print(doc.tensor)

[[ 0.44236845 -0.01180173  0.20908105 ... -0.18478878 -0.17872915
  -0.20212969]
 [ 1.347008   -0.9065002   1.0384194  ...  0.9311755  -0.88710904
  -1.0599288 ]
 [ 0.84580296 -1.3604947   0.8471415  ...  0.95768267 -0.07491717
   0.32537135]
 ...
 [ 1.4835097   0.13096532  0.63991284 ... -1.166823    0.250414
  -0.7703706 ]
 [-0.18388712 -0.7047122   0.49538338 ...  0.26794562 -0.6400709
   0.57972735]
 [-0.44166976 -0.5572643  -0.5958953  ... -0.62469536 -0.9879791
  -0.19887315]]

To prepare the word embeddings for classification, we will add a tensor column to both the dataframes for training and testing. Each cell in the tensor column should be a tensor array, representing the word embedding vector for the text in the corresponding row. The tensors need to have the same size for both the training and test sets, so we also need to pad the tensors with smaller sizes by adding zeros at the end. We provide a add_padded_tensors function in the utility file for doing this.

df_train_with_tensor, df_test_with_tensor = add_padded_tensors(df_train_preprocessd, df_test_preprocessd)
display(df_test_with_tensor)

	class_idx	class	text	tokens	doc	spacy_tokens	tensor
0	3	Business	Fears for T N pension after talks Unions repre...	[Fears, for, T, N, pension, after, talk, Union...	(Fears, for, T, N, pension, after, talks, Unio...	[fear, t, N, pension, talk, Unions, represent,...	[[-0.9206627, 0.7367932, -0.17732513, 1.061838...
1	4	Sci/Tech	The Race is On: Second Private Team Sets Launc...	[The, Race, be, On, :, Second, Private, Team, ...	(The, Race, is, On, :, Second, Private, Team, ...	[race, Second, private, Team, Sets, Launch, Da...	[[0.3289305, -0.659908, -0.04628387, 0.459979,...
2	4	Sci/Tech	Ky. Company Wins Grant to Study Peptides (AP) ...	[Ky., Company, Wins, Grant, to, Study, Peptide...	(Ky., Company, Wins, Grant, to, Study, Peptide...	[Company, win, Grant, Study, Peptides, AP, AP,...	[[-0.085725844, -1.1698873, 0.5357769, 1.32863...
3	4	Sci/Tech	Prediction Unit Helps Forecast Wildfires (AP) ...	[Prediction, Unit, Helps, Forecast, Wildfires,...	(Prediction, Unit, Helps, Forecast, Wildfires,...	[prediction, Unit, help, Forecast, Wildfires, ...	[[-0.092530996, -1.1091797, 0.007843953, 0.410...
4	4	Sci/Tech	Calif. Aims to Limit Farm-Related Smog (AP) AP...	[Calif, ., Aims, to, Limit, Farm-Related, Smog...	(Calif., Aims, to, Limit, Farm, -, Related, Sm...	[aim, Limit, Farm, relate, Smog, AP, AP, South...	[[-0.10481978, -0.5771505, 1.3032898, 0.098989...
...	...	...	...	...	...	...	...
473	3	Business	New Overtime Rules Take Effect New Bush admini...	[New, Overtime, Rules, Take, Effect, New, Bush...	(New, Overtime, Rules, Take, Effect, New, Bush...	[New, Overtime, Rules, Effect, New, Bush, admi...	[[0.63283896, -0.8598237, 0.18151897, 0.718614...
479	3	Business	Dollar Holds Gains, Fed Comments Help TOKYO (...	[Dollar, Holds, Gains, ,, Fed, Comments, Help,...	(Dollar, Holds, Gains, ,, Fed, Comments, Help,...	[Dollar, Holds, Gains, Fed, comment, help, TOK...	[[0.27932516, -1.1461885, 0.29283038, 0.256060...
481	3	Business	Dark arts of spin evident in phoney war for Ab...	[Dark, art, of, spin, evident, in, phoney, war...	(Dark, arts, of, spin, evident, in, phoney, wa...	[dark, art, spin, evident, phoney, war, Abbey,...	[[0.6209623, -0.8559445, -0.29310828, 0.090776...
482	3	Business	Controversial US Overtime Rules Take Effect Ne...	[Controversial, US, Overtime, Rules, Take, Eff...	(Controversial, US, Overtime, Rules, Take, Eff...	[controversial, Overtime, Rules, Effect, new, ...	[[0.19809127, -0.7492492, 0.23203318, 0.294419...
484	3	Business	SAS Braathens to cut Gatwick, Geneva flights b...	[SAS, Braathens, to, cut, Gatwick, ,, Geneva, ...	(SAS, Braathens, to, cut, Gatwick, ,, Geneva, ...	[SAS, Braathens, cut, Gatwick, Geneva, flight,...	[[-0.34229928, -0.94983363, 0.3551279, 0.64431...

400 rows × 7 columns

Build the Classifier#

Our neural network will take the embedding representation of the document as input and predict the corresponding topic using a softmax output layer. We will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.

The following code demonstrates how to implement a neural network for topic classification in PyTorch. First let’s do some more preparations for our inputs, turning them into PyTorch tensors.

# Transform spaCy tensors into PyTorch tensors.
input_train = torch.from_numpy(np.stack(df_train_with_tensor["tensor"]))
input_test = torch.from_numpy(np.stack(df_test_with_tensor["tensor"]))

# Get the labels, move to 0-indexed instead of 1-indexed.
train_labels = torch.from_numpy(df_train_with_tensor["class_idx"].values) - 1
test_labels = torch.from_numpy(df_test_with_tensor["class_idx"].values) - 1

# One-hot encode labels for training.
train_target = torch.zeros((len(train_labels), 4))
train_target = train_target.scatter_(1, train_labels.unsqueeze(1), 1).unsqueeze(1)

Then, it is time to define our network. The neural net consists of three fully connected layers (fc1, fc2, and fc3) with ReLU activation (relu) in between each layer. We flatten the input tensor using view before passing it through the fully connected layers. Finally, we apply the softmax activation function (softmax) to the output tensor to obtain the predicted probabilities for each class.

class TopicClassifier(nn.Module):
    def __init__(self, input_width, input_length, output_size):
        super(TopicClassifier, self).__init__()
        self.input_width = input_width
        self.input_length = input_length
        self.output_size = output_size

        self.fc1 = nn.Linear(input_width * input_length, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, output_size)

        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        # Flatten the input tensor.
        x = x.view(-1, self.input_width * self.input_length)

        # Pass through the fully connected layers with ReLU activation.
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)

        # Apply softmax activation to the output.
        x = self.softmax(x)
        return x

Now it’s time to train our network, this may take a while, but the current loss will be printed after every epoch. If you want to run the code faster, you can also put this notebook on Google Colab and use its provided GPU to speed up computing.

# Define parameters.
n_classes = len(train_labels.unique())
input_size = input_train.shape[1:]
num_epochs = 5
lr = 0.001

# Define model, loss function and optimizer.
model = TopicClassifier(*input_size, output_size=n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# Training loop.
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(zip(input_train, train_target)):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")

Epoch [1/5], Loss: 1.3852
Epoch [2/5], Loss: 1.3765
Epoch [3/5], Loss: 1.3249
Epoch [4/5], Loss: 1.0777
Epoch [5/5], Loss: 0.8335

Optional Assignment for Task 5#

The following code evaluates the model using a confusion matrix.

# Evaluate the neural net on the test set.
model.eval()

# Sample from the model.
with torch.no_grad():
    test_outputs = model(input_test)
    # Reuse our previous function to get the label with biggest probability.
    test_pred = np.argmax(test_outputs.detach(), axis=1)

# Set model back to training mode.
model.train()

# Compute the confusion matrix
cm = confusion_matrix(test_labels, test_pred)

# Plot the confusion matrix using seaborn
labels = ["World", "Sports", "Business", "Sci/Tech"]
h = sns.heatmap(cm, annot=True, cmap="Blues", fmt="g", xticklabels=labels, yticklabels=labels)
ax = plt.xlabel("Predicted Labels")
ax = plt.ylabel("True Labels")

If you do not feel done with text data yet, there is always more to do. In this optional assignment, you can experiment with the number of epochs, learning rate, vector size, optimizer, neural network architecture, regularization, etc. Also, we only use a small subset of this dataset for performance issues. If you have a high-end computer, you can go to the beginning of this tutorial to increase the size of the subset.

Even during the preprocessing, we could have done some things differently, like making everything lowercase and removing punctuation. Be aware that every choice you make along the way trickles down into your pipeline and can have some effect on your results. Also, take the time to write the code to evaluate the model with more metrics, such as accuracy, precision, recall, and the F1 score.

1: Credit: this teaching material is created by Robert van Straten and revised by Alejandro Monroy under the supervision of Yen-Chia Hsu.

Text Data Processing

Tutorial (Text Data Processing)

Contents

Tutorial (Text Data Processing)#

Scenario#

Import Packages#

Task Answers#

Utility File#

Task 3: Preprocess Text Data#

Tokenization#

Part-of-speech tagging#

Stemming / Lemmatization#

Stopword Removal#

Assignment for Task 3.1: Tokenization and Lemmatization#

Assignment for Task 3.2: Word Counts#

Assignment for Task 3.3: Stop Words Removal#

Another Option: spaCy#

Task 4: Unsupervised Learning - Topic Modeling#

Assignment for Task 4#

Task 5: Supervised Learning - Topic Classification#

Compute Word Embeddings#

Build the Classifier#

Optional Assignment for Task 5#