Tutorial (Text Data Processing)#

(Last updated: Mar 7, 2023)1

This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the Natural Language Processing (NLP) pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.

The AG’s News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification.

You can use the following links to jump to the tasks and assignments:

Task 3: Preprocess Text Data
Task 4: Another option: spaCy
- Assignment for Task 4
Task 5: Unsupervised Learning - Topic Modelling
- Evaluation
- Assignment for Task 5
Task 6: Word Embeddings
- Assignment for Task 6
Task 7: Supervised Learning - Topic Classification
- Assignment for Task 7 (Optional)
Optional Assignment / Takeaways

Scenario#

The AG’s News Topic Classification Dataset is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end, make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks.

Import Packages#

We put all the packages that are needed for this tutorial below:

import os
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import torch
import torch.nn as nn
import torch.optim as optim

from gensim.models import Word2Vec

from nltk.corpus import stopwords, wordnet
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, confusion_matrix

from tqdm.notebook import tqdm

from xml.sax import saxutils as su

# Add tqdm functions to pandas.
tqdm.pandas()

Task Answers#

The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.

def check_answer_df(df_result, df_answer, n=1):
    """
    This function checks if two output dataframes are the same.

    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert df_answer.equals(df_result)
        print(f"Test case {n} passed.")
    except Exception:
        print(f"Test case {n} failed.")
        print("Your output is:")
        display(df_result)
        print("Expected output is:")
        display(df_answer)


def check_answer_np(arr_result, arr_answer, n=1):
    """
    This function checks if two output dataframes are the same.

    Parameters
    ----------
    df_result : pandas.DataFrame
        The result from the output of a function.
    df_answer: pandas.DataFrame
        The expected output of the function.
    n : int
        The numbering of the test case.
    """
    try:
        assert np.array_equal(arr_result, arr_answer)
        print(f"Test case {n} passed.")
    except Exception:
        print(f"Test case {n} failed.")
        print("Your output is:")
        print(arr_result)
        print("Expected output is:")
        print(arr_answer)


def answer_tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the text column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added tokens column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Apply the tokenizer to create the tokens column.
    df['tokens'] = df['text'].progress_apply(word_tokenize)

    # Apply the lemmatizer on every word in the tokens list.
    df['tokens'] = df['tokens'].progress_apply(lambda tokens: [lemmatizer.lemmatize(token, wordnet_pos(tag))
                                                               for token, tag in nltk.pos_tag(tokens)])
    return df


def answer_most_used_words(df, token_col='tokens'):
    """
    Generate a dataframe with the 5 most used words per class, and their count.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the class and tokens columns.

    Returns
    -------
    pandas.DataFrame
        The dataframe with 5 rows per class, and an added 'count' column.
        The dataframe is sorted in ascending order on the class and in descending order on the count.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Filter out non-words
    df[token_col] = df[token_col].apply(lambda tokens: [token for token in tokens if token.isalpha()])

    # Explode the tokens so that every token gets its own row.
    df = df.explode(token_col)

    # Option 1: groupby on class and token, get the size of how many rows per item,
    # add that as a column.
    counts = df.groupby(['class', token_col]).size().reset_index(name='count')

    # Option 2: make a pivot table based on the class and token based on how many
    # rows per combination there are, add counts as a column.
    # counts = counts.pivot_table(index=['class', 'tokens'], aggfunc='size').reset_index(name='count')

    # Sort the values on the class and count, get only the first 5 rows per class.
    counts = counts.sort_values(['class', 'count'], ascending=[True, False]).groupby('class').head()

    return counts


def answer_remove_stopwords(df):
    """
    Remove stopwords from the tokens.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the tokens column,
        where the value in each row is a list of tokens.

    Returns
    -------
    pandas.DataFrame
        The dataframe with stopwords removed from the tokens column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Using a set for quicker lookups.
    stopwords_set = set(stopwords_list)

    # Filter stopwords from tokens.
    df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens
                                                      if token.lower() not in stopwords_set])

    return df


def answer_spacy_tokens(df):
    """
    Add a column with a list of lemmatized tokens, without stopwords.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the doc column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the spacy_tokens column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    df['spacy_tokens'] = df['doc'].apply(lambda tokens: [token.lemma_ for token in tokens
                                                         if not token.is_stop])

    return df


def answer_largest_proportion(arr):
    """
    For every row, get the column number where it has the largest value.

    Parameters
    ----------
    arr : numpy.array
        The array with the amount of topics as the amount of columns
        and the amount of documents as the number of rows.
        Every row should sum up to 1.

    Returns
    -------
    pandas.DataFrame
        The 1-dimensional array containing the label of the topic
        the document has the largest proportion in.
    """
    return np.argmax(arr, axis=1)


def answer_add_padded_tensors(df1, df2):
    """
    Add a tensor column to the dataframes, with every tensor having the same dimensions.

    Parameters
    ----------
    df_train : pandas.DataFrame
        The first dataframe containing at least the tokens or doc column.
    df_test : pandas.DataFrame
        The second dataframe containing at least the tokens or doc column.

    Returns
    -------
    tuple[pandas.DataFrame]
        The dataframes with the added tensor column.
    """
    # Copy the dataframes to avoid editing the originals.
    df1 = df1.copy(deep=True)
    df2 = df2.copy(deep=True)

    # Sample 10% from both datasets.
    df1 = df1.sample(frac=0.1, random_state=42)
    df2 = df2.sample(frac=0.1, random_state=42)

    # Add tensors (option 1: our own model).
    for df in [df1, df2]:
        df['tensor'] = df['tokens'].apply(lambda tokens: np.vstack([w2v_model.wv[token]
                                                                    for token in tokens]))

    # Add tensors (option 2: spaCy tensors).
    for df in [df1, df2]:
        df['tensor'] = df['doc'].apply(lambda doc: doc.tensor)

    # Determine the largest amount of columns.
    largest = max(df1['tensor'].apply(lambda x: x.shape[0]).max(),
                  df2['tensor'].apply(lambda x: x.shape[0]).max())

    # Pad our tensors to that amount.
    for df in [df1, df2]:
        df['tensor'] = df['tensor'].apply(lambda x: np.pad(x, ((0, largest - x.shape[0]), (0, 0))))

    return df1, df2


# Confusion matrix code

# # Compute the confusion matrix
# cm = confusion_matrix(test_labels, test_pred)

# # Plot the confusion matrix using seaborn
# sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', xticklabels=labels, yticklabels=labels)

Task 3: Preprocess Text Data#

In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.

df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

display(df_train, df_test)

	Class Index	Title	Description
0	3	Wall St. Bears Claw Back Into the Black (Reuters)	Reuters - Short-sellers, Wall Street's dwindli...
1	3	Carlyle Looks Toward Commercial Aerospace (Reu...	Reuters - Private investment firm Carlyle Grou...
2	3	Oil and Economy Cloud Stocks' Outlook (Reuters)	Reuters - Soaring crude prices plus worries\ab...
3	3	Iraq Halts Oil Exports from Main Southern Pipe...	Reuters - Authorities have halted oil export\f...
4	3	Oil prices soar to all-time record, posing new...	AFP - Tearaway world oil prices, toppling reco...
...	...	...	...
119995	1	Pakistan's Musharraf Says Won't Quit as Army C...	KARACHI (Reuters) - Pakistani President Perve...
119996	2	Renteria signing a top-shelf deal	Red Sox general manager Theo Epstein acknowled...
119997	2	Saban not going to Dolphins yet	The Miami Dolphins will put their courtship of...
119998	2	Today's NFL games	PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...
119999	2	Nets get Carter from Raptors	INDIANAPOLIS -- All-Star Vince Carter was trad...

120000 rows × 3 columns

	Class Index	Title	Description
0	3	Fears for T N pension after talks	Unions representing workers at Turner Newall...
1	4	The Race is On: Second Private Team Sets Launc...	SPACE.com - TORONTO, Canada -- A second\team o...
2	4	Ky. Company Wins Grant to Study Peptides (AP)	AP - A company founded by a chemistry research...
3	4	Prediction Unit Helps Forecast Wildfires (AP)	AP - It's barely dawn when Mike Fitzpatrick st...
4	4	Calif. Aims to Limit Farm-Related Smog (AP)	AP - Southern California's smog-fighting agenc...
...	...	...	...
7595	1	Around the world	Ukrainian presidential candidate Viktor Yushch...
7596	2	Void is filled with Clement	With the supply of attractive pitching options...
7597	2	Martinez leaves bitter	Like Roger Clemens did almost exactly eight ye...
7598	3	5 of arthritis patients in Singapore take Bext...	SINGAPORE : Doctors in the United States have ...
7599	3	EBay gets into rentals	EBay plans to buy the apartment and home renta...

7600 rows × 3 columns

As you can see, all the classes are distributed evenly in the train and test data.

display(df_train['Class Index'].value_counts(), df_test['Class Index'].value_counts())

  30000
  30000
  30000
  30000
Name: Class Index, dtype: int64

  1900
  1900
  1900
  1900
Name: Class Index, dtype: int64

To make the data more understandable, we will make the classes more understandable by adding a class column from the original Class Index column, containing the category of the news article. To process both the title and news text together, we will combine the Title and Description columns into one text column. We will deal with just the train data until the point where we need the test data again.

def reformat_data(df):
    """
    Reformat the Class Index column to a Class column and combine
    the Title and Description columns into a Text column.
    Select only the class_idx, class and text columns afterwards.

    Parameters
    ----------
    df : pandas.DataFrame
        The original dataframe.
   
    Returns
    -------
    pandas.DataFrame
        The reformatted dataframe.
    """
    # Make the class column using a dictionary.
    df = df.rename(columns={'Class Index': 'class_idx'})
    classes = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
    df['class'] = df['class_idx'].apply(classes.get)

    # Use string concatonation for the Text column and unescape html characters.
    df['text'] = (df['Title'] + ' ' + df['Description']).apply(su.unescape)

    # Select only the class_idx, class, and text column.
    df = df[['class_idx', 'class', 'text']]
    return df


df_train = reformat_data(df_train)
display(df_train)

	class_idx	class	text
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...
4	3	Business	Oil prices soar to all-time record, posing new...
...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...

120000 rows × 3 columns

Tokenization#

Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. It helps language models to understand and analyze text data by breaking it down into smaller, more manageable pieces. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing NLP applications.

For example, in languages like English, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like “can’t” and hyphenated words like “self-driving”. And in Dutch, where multiple nouns can be combined into one bigger noun without any delimiter this can be hard. How would you tokenize “hippopotomonstrosesquippedaliofobie”? In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult.

To illustrate the use of tokenization, let’s consider the following example, which tokenizes a sample text using the word_tokenize function from the NLTK package. That function uses a pre-trained tokenization model for English.

# Sample text.
text = "The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day."

# Tokenize the text.
tokens = word_tokenize(text)

# Print the text and the tokens.
print("Original text:", text)
print("Tokenized text:", tokens)

Original text: The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
Tokenized text: ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cats', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Part-of-speech tagging#

Part-of-speech (POS) tagging is the process of assigning each word in a text corpus with a specific part-of-speech tag based on its context and definition. The tags typically include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and more. POS tagging can help other NLP tasks disambiguate a token somewhat due to the added context.

pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.'), ('The', 'DT'), ('cats', 'NNS'), ('could', 'MD'), ("n't", 'RB'), ('wait', 'VB'), ('to', 'TO'), ('sleep', 'VB'), ('all', 'DT'), ('day', 'NN'), ('.', '.')]

Stemming / Lemmatization#

Stemming and lemmatization are two common techniques used in NLP to preprocess and normalize text data. Both techniques involve transforming words into their root form, but they differ in their approach and the level of normalization they provide.

Stemming is a technique that involves reducing words to their base or stem form by removing any affixes or suffixes. For example, the stem of the word “lazily” would be “lazi”. Stemming is a simple and fast technique that can be useful. However, it can also produce inaccurate or incorrect results since it does not consider the context or part of speech of the word.

Lemmatization, on the other hand, is a more sophisticated technique that involves identifying the base or dictionary form of a word, also known as the lemma. Unlike stemming, lemmatization can consider the context and part of speech of the word, which can make it more accurate and reliable. With lemmatization, the lemma of the word “lazily” would be “lazy”. Lemmatization can be slower and more complex than stemming but provides a higher level of normalization.

# Initialize the stemmer and lemmatizer.
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()


def wordnet_pos(nltk_pos):
    """
    Function to map POS tags to wordnet tags for lemmatizer.
    """
    if nltk_pos.startswith('V'):
        return wordnet.VERB
    elif nltk_pos.startswith('J'):
        return wordnet.ADJ
    elif nltk_pos.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN


# Perform stemming and lemmatization seperately on the tokens.
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token, wordnet_pos(tag))
                     for token, tag in nltk.pos_tag(tokens)]

# Print the results.
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)

Stemmed text: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.', 'the', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Lemmatized text: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']

Stopword removal#

Stopword removal is a common technique used in NLP to preprocess and clean text data by removing words that are considered to be of little or no value in terms of conveying meaning or information. These words are called “stopwords” and they include common words such as “the”, “a”, “an”, “and”, “or”, “but”, and so on.

The purpose of stopword removal in NLP is to improve the accuracy and efficiency of text analysis and processing by reducing the noise and complexity of the data. Stopwords are often used to form grammatical structures in a sentence, but they do not carry much meaning or relevance to the main topic or theme of the text. So by removing these words, we can reduce the dimensionality of the text data, improve the performance of machine learning models, and speed up the processing of text data. NLTK has a predefined list of stopwords for English.

# English stopwords in NLTK.
stopwords_list = stopwords.words('english')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Assignment for Task 3#

Your task (which is your assignment) is to write functions to do the following:

Since we want to use our text to make a model later on, we need to preprocess it. Add a tokens column to the df_train dataframe with the text tokenized, then lemmatize those tokens. You must use the POS tags when lemmatizing.
- Hint: Use the pandas.Series.apply function with the imported nltk.tokenize.word_tokenize function. Recall that you can use the pd.Series.apply? syntax in a code cell for more information.
- Hint: use the nltk.stem.WordNetLemmatizer.lemmatize function to lemmatize a token. Use the wordnet_pos function to obtain the POS tag for the lemmatizer.
Tokenizing and lemmatizing the entire dataset can take a while too. We use tqdm and the pandas.Series.progress_apply in the answer version to show progress bars for the operations.

Our goal is to have a dataframe that looks like the following:

# This part of code will take several minutes to run.
answer_df = answer_tokenize_and_lemmatize(df_train)
display(answer_df)

	class_idx	class	text	tokens
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Into, the, Blac...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, from, Main, Southe...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, to, all-time, record, ,, po...
...	...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...	[Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...	[Renteria, sign, a, top-shelf, deal, Red, Sox,...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...	[Saban, not, go, to, Dolphins, yet, The, Miami...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...	[Today, 's, NFL, game, PITTSBURGH, at, NY, GIA...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...	[Nets, get, Carter, from, Raptors, INDIANAPOLI...

120000 rows × 4 columns

To see what the most used words per class are, create a new, seperate dataframe with the 5 most used words per class. Sort the resulting dataframe ascending on the class and descending on the count.
- Hint: use the pandas.Series.apply and str.isalpha() functions to filter out non-alphabetical tokens.
- Hint: use the pandas.DataFrame.explode to create one row per class and token.
- Hint: use pandas.DataFrame.groupby with .size() afterwards or pandas.DataFrame.pivot_table with size as the aggfunc to obtain the occurences per class.
- Hint: use the pandas.Series.reset_index function to obtain a dataframe with [class, tokens, count] as the columns.
- Hint: use the pandas.DataFrame.sort_values function for sorting a dataframe.
- Hint: use the pandas.DataFrame.groupby and pandas.DataFrame.head functions to get the first 5 rows per class.
Our goal is to have a dataframe that looks like the following:

display(answer_most_used_words(answer_df))

	class	tokens	count
28111	Business	the	37998
17122	Business	a	30841
28259	Business	to	29384
24422	Business	of	22539
22506	Business	in	21446
63963	Sci/Tech	the	40767
64137	Sci/Tech	to	30497
49851	Sci/Tech	a	27686
59243	Sci/Tech	of	26048
50993	Sci/Tech	be	19238
97908	Sports	the	56416
86396	Sports	a	29398
98078	Sports	to	27171
92067	Sports	in	23187
93981	Sports	of	20119
130423	World	the	42548
117945	World	a	32397
124106	World	in	31243
130599	World	to	30663
126243	World	of	28728

Remove the stopwords from the tokens column in the df_train dataframe. Then, check with the most_used_words function: do the most used words say something about the class now?
- Hint: once again, you can use the pandas.Series.apply function.
The top 5 words per class should look like this after removing stopwords:

answer_df = answer_remove_stopwords(answer_df)
display(answer_most_used_words(answer_df))

	class	tokens	count
26317	Business	say	8879
12756	Business	Reuters	6893
15719	Business	US	5609
18895	Business	company	5062
25086	Business	price	4611
61365	Sci/Tech	say	5536
58378	Sci/Tech	new	5149
40328	Sci/Tech	Microsoft	5041
51846	Sci/Tech	company	3781
29300	Sci/Tech	AP	3682
65192	Sports	AP	6245
98155	Sports	win	5492
90163	Sports	game	3819
89768	Sports	first	3696
96792	Sports	team	3571
127374	World	say	10299
106496	World	Iraq	5760
98469	World	AP	5757
112273	World	Reuters	5406
123453	World	kill	4545

def tokenize_and_lemmatize(df):
    """
    Tokenize and lemmatize the text in the dataset.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the text column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added tokens column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################


def most_used_words(df, token_col='tokens'):
    """
    Generate a dataframe with the 5 most used words per class, and their count.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the class and tokens columns.

    Returns
    -------
    pandas.DataFrame
        The dataframe with 5 rows per class, and an added 'count' column.
        The dataframe is sorted in ascending order on the class and in descending order on the count.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################


def remove_stopwords(df):
    """
    Remove stopwords from the tokens.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the tokens column,
        where the value in each row is a list of tokens.

    Returns
    -------
    pandas.DataFrame
        The dataframe with stopwords removed from the tokens column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################
    

df_train = df_train  # Edit this.

The code below tests if all your functions combined match the expected output.

check_answer_df(most_used_words(df_train), answer_most_used_words(answer_df))

Test case 1 failed.
Your output is:

None

Expected output is:

	class	tokens	count
26317	Business	say	8879
12756	Business	Reuters	6893
15719	Business	US	5609
18895	Business	company	5062
25086	Business	price	4611
61365	Sci/Tech	say	5536
58378	Sci/Tech	new	5149
40328	Sci/Tech	Microsoft	5041
51846	Sci/Tech	company	3781
29300	Sci/Tech	AP	3682
65192	Sports	AP	6245
98155	Sports	win	5492
90163	Sports	game	3819
89768	Sports	first	3696
96792	Sports	team	3571
127374	World	say	10299
106496	World	Iraq	5760
98469	World	AP	5757
112273	World	Reuters	5406
123453	World	kill	4545

Task 4: Another option: spaCy#

spaCy is another library used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and much more. It provides pre-trained models for different languages and domains, which can be used as-is but also can be fine-tuned on a specific task or domain.

In an object-oriented way, spaCy can be thought of as a collection of classes and objects that work together to perform NLP tasks. Some of the important functions and classes in spaCy include:

nlp: The core function that provides the main functionality of spaCy. It is used to process text and create a Doc object.
Doc: A container for accessing linguistic annotations like tokens, part-of-speech tags, named entities, and dependency parse information. It is created by the nlp function and represents a processed document.
Token: An object representing a single token in a Doc object. It contains information like the token text, part-of-speech tag, lemma, embedding, and much more.

When a text is processed by spaCy, it is first passed to the nlp function, which uses the loaded model to tokenize the text and applies various linguistic annotations like part-of-speech tagging, named entity recognition, and dependency parsing in the background. The resulting annotations are stored in a Doc object, which can be accessed and manipulated using various methods and attributes. For example, the Doc object can be iterated over to access each Token object in the document.

# Load the small English model in spaCy.
# Disable Named Entity Recognition and the parser in the model pipeline since we're not using it.
# Check the following website for the spaCy NLP pipeline:
# - https://spacy.io/usage/processing-pipelines
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# Process the text using spaCy.
doc = nlp(text)

# This becomes a spaCy Doc object, which prints nicely as the original string.
print(type(doc), doc)

# We can iterate over the tokens in the Doc, since it has already been tokenized underneath.
print(type(doc[0]))
for token in doc:
    print(token)

<class 'spacy.tokens.doc.Doc'> The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
<class 'spacy.tokens.token.Token'>
The
quick
brown
fox
jumped
over
the
lazy
dog
.
The
cats
could
n't
wait
to
sleep
all
day
.

Since a lot of processing has already been done, we can also directly access multiple attributes of the Token objects. For example, we can directly access the lemma of the token with Token.lemma_ and check if a token is a stop word with Token.is_stop.

print(doc[0].lemma_, type(doc[0].lemma_), doc[0].is_stop, type(doc[0].is_stop))

the <class 'str'> True <class 'bool'>

Here is the code to add a column with a Doc representation of the text column to the dataframe. Executing this cell takes several minutes, so we added a progress bar.

def add_spacy(df):
    """
    Add a column with the spaCy Doc objects.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the text column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the added doc column.
    """
    # Copy the dataframe to avoid editing the original one.
    df = df.copy(deep=True)

    # Get the number of CPUs in the machine.
    n_process = max(1, os.cpu_count()-2)

    # Use multiple CPUs to speed up computing.
    df['doc'] = [doc for doc in tqdm(nlp.pipe(df['text'], n_process=n_process), total=df.shape[0])]

    return df


df_train = add_spacy(answer_df)
display(df_train)

	class_idx	class	text	tokens	doc
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Black, (, Reute...	(Wall, St., Bears, Claw, Back, Into, the, Blac...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, Economy, Cloud, Stocks, ', Outlook, (, R...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...	(Iraq, Halts, Oil, Exports, from, Main, Southe...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, all-time, record, ,, pose, ...	(Oil, prices, soar, to, all, -, time, record, ...
...	...	...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...	[Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	(Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...	[Renteria, sign, top-shelf, deal, Red, Sox, ge...	(Renteria, signing, a, top, -, shelf, deal, Re...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...	[Saban, go, Dolphins, yet, Miami, Dolphins, pu...	(Saban, not, going, to, Dolphins, yet, The, Mi...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...	[Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,...	(Today, 's, NFL, games, PITTSBURGH, at, NY, GI...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...	[Nets, get, Carter, Raptors, INDIANAPOLIS, --,...	(Nets, get, Carter, from, Raptors, INDIANAPOLI...

120000 rows × 5 columns

Assignment for Task 4#

Your task (which is your assignment) is to write a function to do the following:

Add a spacy_tokens column to the df_train dataframe containing a list of lemmatized tokens as strings.
- Hint: use a pandas.Series.apply operation on the doc column to accomplish this.
Our goal is to have a dataframe that looks like the following:

answer_df = answer_spacy_tokens(df_train)
display(answer_df)

	class_idx	class	text	tokens	doc	spacy_tokens
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Black, (, Reute...	(Wall, St., Bears, Claw, Back, Into, the, Blac...	[Wall, St., Bears, Claw, Black, (, Reuters, ),...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...	[Carlyle, look, Commercial, Aerospace, (, Reut...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, Economy, Cloud, Stocks, ', Outlook, (, R...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...	[oil, Economy, Cloud, Stocks, ', Outlook, (, R...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...	(Iraq, Halts, Oil, Exports, from, Main, Southe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, all-time, record, ,, pose, ...	(Oil, prices, soar, to, all, -, time, record, ...	[oil, price, soar, -, time, record, ,, pose, n...
...	...	...	...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...	[Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	(Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	[Pakistan, Musharraf, say, will, quit, Army, C...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...	[Renteria, sign, top-shelf, deal, Red, Sox, ge...	(Renteria, signing, a, top, -, shelf, deal, Re...	[Renteria, sign, -, shelf, deal, Red, Sox, gen...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...	[Saban, go, Dolphins, yet, Miami, Dolphins, pu...	(Saban, not, going, to, Dolphins, yet, The, Mi...	[Saban, go, Dolphins, Miami, Dolphins, courtsh...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...	[Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,...	(Today, 's, NFL, games, PITTSBURGH, at, NY, GI...	[today, NFL, game, PITTSBURGH, NY, giant, Time...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...	[Nets, get, Carter, Raptors, INDIANAPOLIS, --,...	(Nets, get, Carter, from, Raptors, INDIANAPOLI...	[net, Carter, Raptors, INDIANAPOLIS, --, -, St...

120000 rows × 6 columns

def spacy_tokens(df):
    """
    Add a column with a list of lemmatized tokens, without stopwords.

    Parameters
    ----------
    df : pandas.DataFrame
        The dataframe containing at least the doc column.

    Returns
    -------
    pandas.DataFrame
        The dataframe with the spacy_tokens column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################


df_train = df_train  # Edit this.

The code below tests if the function matches the expected output.

check_answer_df(df_train, answer_df)

Test case 1 failed.
Your output is:

	class_idx	class	text	tokens	doc
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Black, (, Reute...	(Wall, St., Bears, Claw, Back, Into, the, Blac...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, Economy, Cloud, Stocks, ', Outlook, (, R...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...	(Iraq, Halts, Oil, Exports, from, Main, Southe...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, all-time, record, ,, pose, ...	(Oil, prices, soar, to, all, -, time, record, ...
...	...	...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...	[Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	(Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...	[Renteria, sign, top-shelf, deal, Red, Sox, ge...	(Renteria, signing, a, top, -, shelf, deal, Re...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...	[Saban, go, Dolphins, yet, Miami, Dolphins, pu...	(Saban, not, going, to, Dolphins, yet, The, Mi...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...	[Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,...	(Today, 's, NFL, games, PITTSBURGH, at, NY, GI...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...	[Nets, get, Carter, Raptors, INDIANAPOLIS, --,...	(Nets, get, Carter, from, Raptors, INDIANAPOLI...

120000 rows × 5 columns

Expected output is:

	class_idx	class	text	tokens	doc	spacy_tokens
0	3	Business	Wall St. Bears Claw Back Into the Black (Reute...	[Wall, St., Bears, Claw, Back, Black, (, Reute...	(Wall, St., Bears, Claw, Back, Into, the, Blac...	[Wall, St., Bears, Claw, Black, (, Reuters, ),...
1	3	Business	Carlyle Looks Toward Commercial Aerospace (Reu...	[Carlyle, Looks, Toward, Commercial, Aerospace...	(Carlyle, Looks, Toward, Commercial, Aerospace...	[Carlyle, look, Commercial, Aerospace, (, Reut...
2	3	Business	Oil and Economy Cloud Stocks' Outlook (Reuters...	[Oil, Economy, Cloud, Stocks, ', Outlook, (, R...	(Oil, and, Economy, Cloud, Stocks, ', Outlook,...	[oil, Economy, Cloud, Stocks, ', Outlook, (, R...
3	3	Business	Iraq Halts Oil Exports from Main Southern Pipe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...	(Iraq, Halts, Oil, Exports, from, Main, Southe...	[Iraq, Halts, Oil, Exports, Main, Southern, Pi...
4	3	Business	Oil prices soar to all-time record, posing new...	[Oil, price, soar, all-time, record, ,, pose, ...	(Oil, prices, soar, to, all, -, time, record, ...	[oil, price, soar, -, time, record, ,, pose, n...
...	...	...	...	...	...	...
119995	1	World	Pakistan's Musharraf Says Won't Quit as Army C...	[Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	(Pakistan, 's, Musharraf, Says, Wo, n't, Quit,...	[Pakistan, Musharraf, say, will, quit, Army, C...
119996	2	Sports	Renteria signing a top-shelf deal Red Sox gene...	[Renteria, sign, top-shelf, deal, Red, Sox, ge...	(Renteria, signing, a, top, -, shelf, deal, Re...	[Renteria, sign, -, shelf, deal, Red, Sox, gen...
119997	2	Sports	Saban not going to Dolphins yet The Miami Dolp...	[Saban, go, Dolphins, yet, Miami, Dolphins, pu...	(Saban, not, going, to, Dolphins, yet, The, Mi...	[Saban, go, Dolphins, Miami, Dolphins, courtsh...
119998	2	Sports	Today's NFL games PITTSBURGH at NY GIANTS Time...	[Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,...	(Today, 's, NFL, games, PITTSBURGH, at, NY, GI...	[today, NFL, game, PITTSBURGH, NY, giant, Time...
119999	2	Sports	Nets get Carter from Raptors INDIANAPOLIS -- A...	[Nets, get, Carter, Raptors, INDIANAPOLIS, --,...	(Nets, get, Carter, from, Raptors, INDIANAPOLI...	[net, Carter, Raptors, INDIANAPOLIS, --, -, St...

120000 rows × 6 columns

We use the answer version of the most_used_words function to again display the top 5 words per class in the dataset. Do you see some differences between the lemmatized tokens obtained from NLTK and spaCy?

display(answer_most_used_words(answer_df, 'spacy_tokens'))

	class	spacy_tokens	count
24965	Business	say	8927
11332	Business	Reuters	6885
22746	Business	oil	5402
17106	Business	company	5107
23716	Business	price	4951
55144	Sci/Tech	new	5549
37848	Sci/Tech	Microsoft	5073
58225	Sci/Tech	say	5023
48269	Sci/Tech	company	3847
28075	Sci/Tech	AP	3692
62256	Sports	AP	6262
94335	Sports	win	5995
85416	Sports	game	4459
92830	Sports	team	3738
91307	Sports	season	3685
121838	World	say	10174
94687	World	AP	5786
101753	World	Iraq	5783
106953	World	Reuters	5414
117760	World	kill	4967

Task 5: Unsupervised Learning - Topic Modelling#

Topic modelling is a technique used in NLP that aims to identify the underlying topics or themes in a collection of texts. One way to perform topic modelling is using the probabilistic model Latent Dirichlet Allocation (LDA).

LDA assumes that each document in a collection is a mixture of different topics, and each topic is a probability distribution over a set of words. The model then infers the underlying topic distribution for each document in the collection and the word distribution for each topic. LDA is trained using an iterative algorithm that maximizes the likelihood of observing the given documents.

To use LDA, we need to represent the documents as a bag of words, where the order of the words is ignored and only the frequency of each word in the document is considered. This bag-of-words representation allows us to represent each document as a vector of word frequencies, which can be used as input to the LDA algorithm. Computing LDA might take a moment on our dataset size.

# Define the number of topics to model with LDA.
num_topics = 4

# Convert preprocessed text to bag-of-words representation using CountVectorizer.
vectorizer = CountVectorizer(max_features=50000)

# fit_transform requires either a string as input or multiple extra arguments
# and functions, so we simply turn the tokens into string.
X = vectorizer.fit_transform(answer_df['spacy_tokens'].apply(lambda x: ' '.join(x)).values)

# Fit LDA to the feature matrix. Verbose so we know what iteration we're on.
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42, verbose=True)
lda.fit(X)

# Extract the topic proportions for each document.
doc_topic_proportions = lda.transform(X)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10

Using this function, we can take a look at the most important words per topic. Do you see some similarities with the most occurring words per class after stopword removal?

def n_top_wordlist(model, features, ntopwords=5):
    """
    Get the 5 most important words per LDA topic.
    """
    output = {}
    for topic_idx, topic in enumerate(model.components_):
        output[topic_idx] = [features[i] for i in topic.argsort()[:-ntopwords - 1:-1]]
    return output


# Get the words from the CountVectorizer.
tf_feature_names = vectorizer.get_feature_names_out()

display(n_top_wordlist(lda, tf_feature_names))

{0: ['39', 'win', 'ap', 'game', 'team'],
['39', 'new', 'microsoft', 'say', 'company'],
['reuters', 'say', '39', 'oil', 'new'],
['say', '39', 'ap', 'reuters', 'president']}

Evaluation#

Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) are two metrics used to evaluate the performance of clustering algorithms.

AMI is a measure that takes into account the possibility of two random clusters appearing to be similar. It is calculated as the difference between the Mutual Information (MI) of two clusterings and the expected MI, divided by the average entropy of the two clusterings minus the expected MI. AMI ranges between 0 and 1, where 0 indicates no agreement between the two clusterings and 1 indicates identical clusterings.

The Rand Index (RI) is a measure that counts the number of pairs of samples that are assigned to the same or different clusters in both the predicted and true clusterings. The raw RI score is then adjusted for chance into the ARI score using a scheme similar to that of AMI. For ARI a score of 0 indicates random labeling and 1 indicates perfect agreement. The ARI is bounded below by -0.5 for very large differences in labeling.

Assignment for Task 5#

Your task (which is your assignment) is to write a function to do the following:

The doc_topic_proportions contains the proportions of how much that document belongs to every topic. For every document, get the topic in which it has the largest proportion. Afterwards, look at the AMI and ARI scores. Can you improve the scores by modelling more topics, using a different set of tokens, or using more epochs?
- Hint: use the numpy.argmax function.
Our goal is to get an array that looks like the following:

answer_topic_most = answer_largest_proportion(doc_topic_proportions)
print(answer_topic_most, answer_topic_most.shape)

[2 1 2 ... 0 0 0] (120000,)

def largest_proportion(arr):
    """
    For every row, get the column number where it has the largest value.

    Parameters
    ----------
    arr : numpy.array
        The array with the amount of topics as the amount of columns
        and the amount of documents as the number of rows.
        Every row should sum up to 1.

    Returns
    -------
    numpy.array
        The 1-dimensional array containing the label of the topic
        the document has the largest proportion in.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################

The code below tests if the function matches the expected output.

topic_most = largest_proportion(doc_topic_proportions)

check_answer_np(topic_most, answer_topic_most)

Test case 1 failed.
Your output is:
None
Expected output is:
[2 1 2 ... 0 0 0]

ami_score = adjusted_mutual_info_score(df_train['class'], answer_topic_most)
ari_score = adjusted_rand_score(df_train['class'], answer_topic_most)

print(f"Adjusted mutual information score: {ami_score:.2f}")
print(f"Adjusted rand score: {ari_score:.2f}")

Adjusted mutual information score: 0.52
Adjusted rand score: 0.54

Do some topics get (way) more documents assigned to them than others? Let’s take a look.

unique, counts = np.unique(answer_topic_most, return_counts=True)

print(dict(zip(unique, counts)))

{0: 30630, 1: 30442, 2: 27348, 3: 31580}

Task 6: Word Embeddings#

Word embeddings represent words as vectors in a high-dimensional space. The key idea behind word embeddings is that words with similar meanings tend to appear in similar contexts, and therefore their vector representations should be close together in this high-dimensional space. Word embeddings have been widely used in various NLP tasks such as sentiment analysis, machine translation, and information retrieval.

There are several techniques to generate word embeddings, but one of the most popular methods is the Word2Vec algorithm, which is based on a neural network architecture. Word2Vec learns embeddings by predicting the probability of a word given its context (continuous bag of words or skip-gram model). The output of the network is a set of word vectors that can be used as embeddings.

We can train a Word2Vec model ourselves, but keep in mind that later on, it’s not nice if we don’t have embeddings for certain words in the test set. So let’s first apply the familiar preprocessing steps to the test set:

# Reformat df_test.
df_test = reformat_data(df_test)

# NLTK preprocessing.
df_test = answer_remove_stopwords(answer_tokenize_and_lemmatize(df_test))

# spaCy preprocessing.
df_test = answer_spacy_tokens(add_spacy(df_test))

display(df_test)

	class_idx	class	text	tokens	doc	spacy_tokens
0	3	Business	Fears for T N pension after talks Unions repre...	[Fears, N, pension, talk, Unions, represent, w...	(Fears, for, T, N, pension, after, talks, Unio...	[fear, t, N, pension, talk, union, represent, ...
1	4	Sci/Tech	The Race is On: Second Private Team Sets Launc...	[Race, :, Second, Private, Team, Sets, Launch,...	(The, Race, is, On, :, Second, Private, Team, ...	[Race, :, Second, Private, Team, Sets, Launch,...
2	4	Sci/Tech	Ky. Company Wins Grant to Study Peptides (AP) ...	[Ky., Company, Wins, Grant, Study, Peptides, (...	(Ky., Company, Wins, Grant, to, Study, Peptide...	[Ky., Company, Wins, Grant, Study, Peptides, (...
3	4	Sci/Tech	Prediction Unit Helps Forecast Wildfires (AP) ...	[Prediction, Unit, Helps, Forecast, Wildfires,...	(Prediction, Unit, Helps, Forecast, Wildfires,...	[Prediction, Unit, help, Forecast, Wildfires, ...
4	4	Sci/Tech	Calif. Aims to Limit Farm-Related Smog (AP) AP...	[Calif, ., Aims, Limit, Farm-Related, Smog, (,...	(Calif., Aims, to, Limit, Farm, -, Related, Sm...	[Calif., aim, Limit, Farm, -, relate, Smog, (,...
...	...	...	...	...	...	...
7595	1	World	Around the world Ukrainian presidential candid...	[Around, world, Ukrainian, presidential, candi...	(Around, the, world, Ukrainian, presidential, ...	[world, ukrainian, presidential, candidate, Vi...
7596	2	Sports	Void is filled with Clement With the supply of...	[Void, fill, Clement, supply, attractive, pitc...	(Void, is, filled, with, Clement, With, the, s...	[Void, fill, Clement, supply, attractive, pitc...
7597	2	Sports	Martinez leaves bitter Like Roger Clemens did ...	[Martinez, leave, bitter, Like, Roger, Clemens...	(Martinez, leaves, bitter, Like, Roger, Clemen...	[Martinez, leave, bitter, like, Roger, Clemens...
7598	3	Business	5 of arthritis patients in Singapore take Bext...	[5, arthritis, patient, Singapore, take, Bextr...	(5, of, arthritis, patients, in, Singapore, ta...	[5, arthritis, patient, Singapore, Bextra, Cel...
7599	3	Business	EBay gets into rentals EBay plans to buy the a...	[EBay, get, rental, EBay, plan, buy, apartment...	(EBay, gets, into, rentals, EBay, plans, to, b...	[EBay, get, rental, EBay, plan, buy, apartment...

7600 rows × 6 columns

To obtain the complete model, we combine the tokens column into one series and call the Word2Vec function.

# Get all tokens into one series.
tokens_both = pd.concat([df_train['tokens'], df_test['tokens']])

# Train a Word2Vec model on the NLTK tokens.
w2v_model = Word2Vec(tokens_both.values, vector_size=96, min_count=1)

To obtain the embeddings, we can use the Word2Vec.wv[word] syntax. To get multiple vectors nicely next to each other in a 2D matrix, we can call numpy.vstack.

print(np.vstack([w2v_model.wv[word] for word in ['rain', 'cat', 'dog']]))

[[-2.2940760e+00 -4.8420104e-01 -4.4042167e-01  6.8525958e-01
  -1.4821498e+00  2.4811094e+00 -9.9135214e-01  3.2920041e-05
   4.2954880e-01  1.3281231e+00 -1.4077977e+00 -3.8684130e-01
  -7.6716363e-02 -1.1165497e+00  9.3549109e-01  1.0929279e+00
   2.5643912e-01  4.6144328e-01  7.0888650e-01  1.1422478e+00
  -7.2016567e-01 -1.5233663e+00 -3.0344512e+00 -2.2352202e+00
   5.0477499e-01 -1.3620797e-01  1.1560580e-01 -8.7251730e-02
   1.3159508e+00 -3.0589628e-01  1.8796422e+00  4.5079884e-01
   1.9805964e+00  7.5381422e-01 -1.4104112e+00  2.2851133e-01
   2.0000076e-01  6.1606807e-01 -2.5055645e+00  1.3439629e+00
   1.4786874e-01  5.2414012e-01 -7.3734067e-02  5.8897328e-01
  -2.9820289e-02 -7.5076813e-01 -7.3534161e-01 -2.8183448e-01
  -1.7987163e+00  1.0794261e+00 -2.5938660e-01  6.9268233e-01
   1.0267434e+00  3.8907737e-01  1.7781268e+00 -1.0220027e+00
  -5.6030327e-01  1.7142199e+00 -1.0593474e+00  9.3658783e-02
   1.5736760e+00  1.4056033e+00 -1.0150774e+00  3.0287990e-01
   1.5140053e+00 -8.6504275e-01 -1.1736679e+00 -1.6786666e+00
  -1.0144037e+00 -1.3584304e+00 -9.1850632e-01  1.3619289e+00
   1.2904971e+00 -1.6977031e+00 -7.3090231e-01 -3.4742847e-01
   6.7464776e-02  8.9619391e-02 -1.1155491e+00 -6.6920400e-01
  -1.3479675e-01  6.5459555e-01 -5.8369678e-01 -1.0921571e+00
   4.1924748e-01  3.1124693e-01 -1.3012956e+00  2.1808662e-01
   6.9067222e-01  1.6104497e+00  1.3644813e+00 -3.1946927e-02
  -7.4468619e-01 -5.9339243e-01  4.1335011e-01  7.3781967e-01]
 [-1.6707586e-01 -1.7751616e-01 -1.5991940e-01 -1.9989851e-01
  -2.7424154e-01  3.6543764e-02  2.1962922e-02 -2.4741021e-01
  -4.4054174e-01 -1.8425202e-01  1.1129478e-01  2.0092180e-01
  -8.3667494e-02 -5.8217788e-01 -6.6241229e-01  4.6616954e-01
  -3.7988979e-01 -7.2291568e-02 -3.9707631e-01  2.8176224e-01
   4.9108905e-01 -2.1542142e-01  3.4716297e-02 -2.0940323e-01
   2.5171158e-01 -1.9280653e-01 -5.5869442e-01 -2.5533444e-01
  -2.0840354e-01 -1.1751265e-01 -4.3067787e-02  5.4453087e-01
  -1.4280096e-01  1.8038911e-01  2.2600693e-01 -1.9326421e-02
   2.5109005e-01 -7.1796763e-01  6.8420961e-02  4.5902830e-01
   8.8544242e-02 -1.1893602e-01  6.9759995e-01  1.7631820e-01
  -1.6763940e-01 -2.9361406e-01 -4.1476540e-02 -1.7099978e-02
  -3.8931394e-01  4.3411884e-01  3.8996592e-02  1.6359149e-01
  -1.8415020e-01 -2.8604241e-02  3.6487141e-01 -3.8500302e-02
   2.6624441e-01  2.9064500e-01 -2.8262255e-01  1.2855190e-01
   3.8448983e-01 -2.5975108e-01 -3.1233981e-01  1.6406569e-01
   3.4801754e-01 -4.9045366e-01  1.7986092e-01  6.9124356e-02
  -3.0077201e-01 -2.3007296e-01  2.5923157e-01 -1.6732115e-02
  -3.8551494e-01  1.1382518e-01 -3.3340454e-01 -6.4890683e-02
   5.5055684e-01  5.0011641e-01 -2.8933269e-01 -3.8424399e-02
   6.6647522e-02 -4.9694654e-02 -1.7739089e-01 -3.3246987e-02
   3.2536820e-01  3.2836625e-01 -1.9247569e-01 -2.1271285e-02
  -1.0841837e-01  1.4557414e-01  2.8648019e-01 -5.0694276e-02
  -5.6086332e-02  1.8716384e-01  5.0777632e-01  3.0855319e-01]
 [-5.3797096e-01 -4.1078672e-01 -4.6644393e-01 -1.7776264e-01
  -6.1859310e-01  1.3632537e-01  8.6262353e-02 -4.7705841e-01
  -4.5533961e-01 -1.8748164e-02  5.8455920e-01 -4.3029517e-02
  -1.8955900e-01 -7.9340595e-01 -1.1365076e+00  7.3272949e-01
  -2.8275236e-01 -1.8743893e-02 -2.1410777e-01  1.3195230e-02
   4.7014341e-01 -9.3369074e-02 -3.8174710e-01 -3.7527493e-01
   1.2884773e-01 -2.0903893e-01 -4.8333859e-01 -3.7088567e-01
  -1.7931862e-02 -6.5597802e-02  1.8665287e-01  4.3456262e-01
   1.5434964e-02  2.8702241e-01  3.7848458e-01  4.7424236e-01
   5.1574731e-01 -8.4790844e-01 -6.2235701e-01  5.6008190e-01
   1.5852283e-01  2.0123389e-01  5.5801046e-01  6.6353047e-01
  -2.4250355e-01 -5.0138640e-01 -4.9429446e-01 -1.3960998e-01
  -6.4502162e-01  8.8502163e-01 -4.5185152e-01  1.0153203e+00
  -2.1195376e-01 -1.9655213e-01  5.2993536e-01 -3.8079235e-01
   2.0738366e-01  4.7517672e-01 -1.0797164e+00 -1.9228728e-01
   3.9164144e-01 -1.4024313e-01 -3.4790218e-01  3.5092226e-01
   3.5909703e-01 -1.9561702e-01 -3.6081094e-01  2.2829367e-01
  -9.2727669e-02  4.3977797e-02  3.8294426e-01  2.7670828e-01
  -9.1451295e-02  2.8349352e-01  8.1916869e-02 -1.2423317e-01
   2.2019100e-01  2.3845011e-01  2.2413979e-01 -2.7076536e-01
  -4.0909567e-01  1.2636584e-01  1.8710716e-01  6.3947570e-01
   2.6947409e-01  4.5367742e-01 -7.8334993e-01  3.4023982e-02
   2.5921276e-01  2.1218823e-01  4.7835699e-01 -2.5282547e-01
   1.3911194e-02 -4.7073115e-02  1.1101334e+00  6.6093242e-01]]

The spaCy model we used has a Tok2Vec algorithm in its pipeline, so we can directly access the 2D matrix of all word vectors on a document with the Doc.tensor attribute. Keep in mind this still contains the embeddings of the stopwords.

print(doc.tensor)

[[ 0.30832282 -0.4600038  -0.6966157  ... -0.03586689 -0.8444165
   0.33138227]
 [-0.16609877  0.26174432 -0.34486908 ...  0.11861527 -0.11567482
  -0.9424331 ]
 [-0.1972521  -0.06649795 -0.5903488  ... -0.6296763   0.11471006
  -0.27722898]
 ...
 [-0.7060187  -0.5229768   0.8356328  ... -0.76241744 -1.0825483
  -0.1288386 ]
 [ 0.28227934  0.8741193   1.4112176  ...  1.3289344   0.23879066
   0.7562135 ]
 [-0.8391066   1.0865963   2.1023006  ... -0.42140386  0.18943703
   1.1026181 ]]

Assignment for Task 6#

Your task (which is your assignment) is to write a function to do the following:

First, sample 10% of the data from both datasets, we’re not going to be using all the data for the neural network.
- Hint: use the pandas.DataFrame.sample function to sample a fraction of the data. Specify a random_state value to always get the same rows from the dataframe.
Add a tensor column to both the test and train dataframes, the column should hold one array per row, containing all the word embedding vectors as columns. You can choose whether to use vectors from our new model or the ones from spaCy.
Determine the largest amount of columns in the tensor column, between both datasets.
- Hint: use the numpy.ndarray.shape attribute to see the dimensions of an array.
- Hint: use the pd.Series.max function to determine the largest item in a series.
Pad all arrays in the tensor column to be equal in size to the biggest tensor, by adding columns of zeroes at the end. This way all inputs for a neural network have the same size.
- Hint: use the numpy.pad function to pad an array.
After the function, our df_train could look like the following:

answer_df_train, answer_df_test = answer_add_padded_tensors(answer_df, df_test)
display(answer_df_train)

	class_idx	class	text	tokens	doc	spacy_tokens	tensor
71787	3	Business	BBC set for major shake-up, claims newspaper L...	[BBC, set, major, shake-up, ,, claim, newspape...	(BBC, set, for, major, shake, -, up, ,, claims...	[BBC, set, major, shake, -, ,, claim, newspape...	[[0.15803973, 0.75629187, -0.17498428, -0.2533...
67218	3	Business	Marsh averts cash crunch Embattled insurance b...	[Marsh, avert, cash, crunch, Embattled, insura...	(Marsh, averts, cash, crunch, Embattled, insur...	[Marsh, avert, cash, crunch, embattle, insuran...	[[-0.19821277, -0.14202696, -0.73497164, -0.32...
54066	2	Sports	Jeter, Yankees Look to Take Control (AP) AP - ...	[Jeter, ,, Yankees, Look, Take, Control, (, AP...	(Jeter, ,, Yankees, Look, to, Take, Control, (...	[Jeter, ,, Yankees, look, Control, (, AP, ), A...	[[0.35961235, 0.85889757, -0.8120774, -1.26034...
7168	4	Sci/Tech	Flying the Sun to Safety When the Genesis caps...	[Flying, Sun, Safety, Genesis, capsule, come, ...	(Flying, the, Sun, to, Safety, When, the, Gene...	[fly, Sun, safety, Genesis, capsule, come, Ear...	[[-0.76188123, -0.41965377, -0.75621116, -0.21...
29618	3	Business	Stocks Seen Flat as Nortel and Oil Weigh NEW ...	[Stocks, Seen, Flat, Nortel, Oil, Weigh, NEW, ...	(Stocks, Seen, Flat, as, Nortel, and, Oil, Wei...	[stock, see, Flat, Nortel, Oil, Weigh, , NEW,...	[[-0.3134395, -0.066423565, 0.8806261, 0.31077...
...	...	...	...	...	...	...	...
27162	2	Sports	Oakland Athletics Team Report - September 14 (...	[Oakland, Athletics, Team, Report, -, Septembe...	(Oakland, Athletics, Team, Report, -, Septembe...	[Oakland, Athletics, Team, Report, -, Septembe...	[[1.7174376, -0.022900194, -0.041351043, -0.71...
82268	4	Sci/Tech	Telcos' convergence strategies diverge CANNES ...	[Telcos, ', convergence, strategy, diverge, CA...	(Telcos, ', convergence, strategies, diverge, ...	[Telcos, ', convergence, strategy, diverge, CA...	[[1.2340894, 0.39759836, 0.4969958, -0.0863048...
7765	4	Sci/Tech	Motive aims to head off system glitches Three ...	[Motive, aim, head, system, glitch, Three, new...	(Motive, aims, to, head, off, system, glitches...	[motive, aim, head, system, glitche, new, soft...	[[-0.5199754, -0.008618206, -0.70200384, -0.55...
25871	3	Business	Campbell #39;s 4th-Qtr Net Drops 20 on Higher ...	[Campbell, #, 39, ;, 4th-Qtr, Net, Drops, 20, ...	(Campbell, #, 39;s, 4th, -, Qtr, Net, Drops, 2...	[Campbell, #, 39;s, 4th, -, Qtr, Net, Drops, 2...	[[0.40166336, -0.34546575, -0.52296686, 0.4144...
57234	4	Sci/Tech	MySQL to make use of Microsoft code The code, ...	[MySQL, make, use, Microsoft, code, code, ,, o...	(MySQL, to, make, use, of, Microsoft, code, Th...	[mysql, use, Microsoft, code, code, ,, open, -...	[[-0.31237996, -0.11844462, 0.38788873, 0.2587...

12000 rows × 7 columns

def add_padded_tensors(df1, df2):
    """
    First, sample 10% of both datasets and only use that.
    Then, add a tensor column to the dataframes, with every tensor having the same dimensions.

    Parameters
    ----------
    df_train : pandas.DataFrame
        The first dataframe containing at least the tokens or doc column.
    df_test : pandas.DataFrame
        The second dataframe containing at least the tokens or doc column.

    Returns
    -------
    tuple[pandas.DataFrame]
        The sampled dataframes with the added tensor column.
    """
    ###################################
    # Fill in your answer here
    return None
    ###################################


df_train, df_test = df_train, df_test  # Edit this.

The code below tests if the function matches the expected output.

try:
    assert (df_train['tensor'].apply(lambda x: x.shape).unique() ==
            df_test['tensor'].apply(lambda x: x.shape).unique())
    print("Test case 1 passed.")
except Exception:
    print("Test case 1 failed. Not all tensor sizes are equal.")

try:
    assert df_test.shape[0] == 760
    print("Test case 1 passed.")
except Exception:
    print("Test case 2 failed. The test dataframe does not have the correct size.")

Test case 1 failed. Not all tensor sizes are equal.
Test case 2 failed. The test dataframe does not have the correct size.

Task 7: Supervised Learning - Topic Classification#

Topic classification is a task in NLP that involves automatically assigning a given text document to one or more predefined categories or topics. This task is essential for various applications, such as document organization, search engines, sentiment analysis, and more.

In recent years, deep learning models have shown remarkable performance in various NLP tasks, including topic classification. We will explore a neural network-based approach for topic classification using the PyTorch framework. PyTorch provides an efficient way to build and train neural networks with a high degree of flexibility and ease of use.

Our neural network will take the embedding representation of the document as input and predict the corresponding topic using a softmax output layer. We will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.

The following code demonstrates how to implement a neural network for topic classification in PyTorch. First let’s do some more preparations for our inputs, turning them into PyTorch tensors.

# Use dataframes from previous assignment. These use the spaCy tensors.
df_train = answer_df_train
df_test = answer_df_test

# Transform inputs into PyTorch tensors.
input_train = torch.from_numpy(np.stack(df_train['tensor']))
input_test = torch.from_numpy(np.stack(df_test['tensor']))

# Get the labels, move to 0-indexed instead of 1-indexed.
train_labels = torch.from_numpy(df_train['class_idx'].values) - 1
test_labels = torch.from_numpy(df_test['class_idx'].values) - 1

# One-hot encode labels for training.
train_target = torch.zeros((len(train_labels), 4))
train_target = train_target.scatter_(1, train_labels.unsqueeze(1), 1).unsqueeze(1)

Then, it’s time to define our network. The neural net consists of three fully connected layers (fc1, fc2, and fc3) with ReLU activation (relu) in between each layer. We flatten the input tensor using view before passing it through the fully connected layers. Finally, we apply the softmax activation function (softmax) to the output tensor to obtain the predicted probabilities for each class.

class TopicClassifier(nn.Module):
    def __init__(self, input_width, input_length, output_size):
        super(TopicClassifier, self).__init__()
        self.input_width = input_width
        self.input_length = input_length
        self.output_size = output_size

        self.fc1 = nn.Linear(input_width * input_length, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, output_size)

        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        # Flatten the input tensor.
        x = x.view(-1, self.input_width * self.input_length)

        # Pass through the fully connected layers with ReLU activation.
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.relu(x)
        x = self.fc3(x)

        # Apply softmax activation to the output.
        x = self.softmax(x)
        return x

Now it’s time to train our network, this may take a while, but the current loss will be printed after every epoch. If you want to run the code faster, you can also put this notebook on Google Colab and use its provided GPU to speed up computing.

# Define parameters.
n_classes = len(train_labels.unique())
input_size = input_train.shape[1:]
num_epochs = 5
lr = 0.001

# Define model, loss function and optimizer.
model = TopicClassifier(*input_size, output_size=n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)

# Training loop.
for epoch in range(num_epochs):
    for i, (inputs, labels) in enumerate(zip(input_train, train_target)):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")

Epoch [1/5], Loss: 1.3646
Epoch [2/5], Loss: 1.2608
Epoch [3/5], Loss: 1.1472
Epoch [4/5], Loss: 0.9604
Epoch [5/5], Loss: 0.8177

Assignment for Task 7 (Optional)#

Your task (which is your assignment) is to do the following:

Use the code below to evaluate the neural network.
Generate a confusion matrix with sklearn.metrics.confusion_matrix (it’s already imported so you can call confusion_matrix).
Plot the confusion matrix using seaborn.heatmap (seaborn is usually imported as sns). Set the annot argument to True and the xticklabels and yticklabels to the labels list.
Also, take the time to evaluate the train set. Is there a notable difference in accuracy, precision, recall, and the F1 score between the train and test sets?

# Evaluate the neural net on the test set.
model.eval()

# Sample from the model.
with torch.no_grad():
    test_outputs = model(input_test)
    # Reuse our previous function to get the label with biggest probability.
    test_pred = answer_largest_proportion(test_outputs.detach())

# Set model back to training mode.
model.train()

labels = ['World', 'Sports', 'Business', 'Sci/Tech']

###################################
# Fill in your answer here
###################################

Optional Assignment / Takeaways#

If you do not feel done with text data yet, there’s always more to do. You can still experiment with the number of epochs, learning rate, vector size, optimizer, neural network layers, regularization and so much more. Even during the preprocessing, we could have done some things differently, like making everything lowercase and removing punctuation. Be aware that every choice you make along the way trickles down into your pipeline and can have some effect on your results.

1: Credit: this teaching material is created by Robert van Straten under the supervision of Yen-Chia Hsu.

Data Science

Tutorial (Text Data Processing)

Contents

Tutorial (Text Data Processing)#

Scenario#

Import Packages#

Task Answers#

Task 3: Preprocess Text Data#

Tokenization#

Part-of-speech tagging#

Stemming / Lemmatization#

Stopword removal#

Assignment for Task 3#

Task 4: Another option: spaCy#

Assignment for Task 4#

Task 5: Unsupervised Learning - Topic Modelling#

Evaluation#

Assignment for Task 5#

Task 6: Word Embeddings#

Assignment for Task 6#

Task 7: Supervised Learning - Topic Classification#

Assignment for Task 7 (Optional)#

Optional Assignment / Takeaways#