Tutorial (Text Data Processing)
Contents
Tutorial (Text Data Processing)#
(Last updated: Mar 7, 2023)1
This tutorial will familiarize you with the data science pipeline of processing text data. We will go through the various steps involved in the Natural Language Processing (NLP) pipeline for topic modelling and topic classification, including tokenization, lemmatization, and obtaining word embeddings. We will also build a neural network using PyTorch for multi-class topic classification using the dataset.
The AG’s News Topic Classification Dataset contains news articles from four different categories, making it a nice source of text data for NLP tasks. We will guide you through the process of understanding the dataset, implementing various NLP techniques, and building a model for classification.
You can use the following links to jump to the tasks and assignments:
Scenario#
The AG’s News Topic Classification Dataset is a collection of over 1 million news articles from more than 2000 news sources. The dataset was created by selecting the 4 largest classes from the original corpus, resulting in 120,000 training samples and 7,600 testing samples. The dataset is provided by the academic community for research purposes in data mining, information retrieval, and other non-commercial activities. We will use it to demonstrate various NLP techniques on real data, and in the end, make 2 models with this data. The files train.csv and test.csv contain all the training and testing samples as comma-separated values with 3 columns: class index, title, and description. Download train.csv and test.csv for the following tasks.
Import Packages#
We put all the packages that are needed for this tutorial below:
import os
import matplotlib.pyplot as plt
import nltk
import numpy as np
import pandas as pd
import seaborn as sns
import spacy
import torch
import torch.nn as nn
import torch.optim as optim
from gensim.models import Word2Vec
from nltk.corpus import stopwords, wordnet
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import adjusted_mutual_info_score, adjusted_rand_score, confusion_matrix
from tqdm.notebook import tqdm
from xml.sax import saxutils as su
# Add tqdm functions to pandas.
tqdm.pandas()
Task Answers#
The code block below contains answers for the assignments in this tutorial. Do not check the answers in the next cell before practicing the tasks.
def check_answer_df(df_result, df_answer, n=1):
"""
This function checks if two output dataframes are the same.
Parameters
----------
df_result : pandas.DataFrame
The result from the output of a function.
df_answer: pandas.DataFrame
The expected output of the function.
n : int
The numbering of the test case.
"""
try:
assert df_answer.equals(df_result)
print(f"Test case {n} passed.")
except Exception:
print(f"Test case {n} failed.")
print("Your output is:")
display(df_result)
print("Expected output is:")
display(df_answer)
def check_answer_np(arr_result, arr_answer, n=1):
"""
This function checks if two output dataframes are the same.
Parameters
----------
df_result : pandas.DataFrame
The result from the output of a function.
df_answer: pandas.DataFrame
The expected output of the function.
n : int
The numbering of the test case.
"""
try:
assert np.array_equal(arr_result, arr_answer)
print(f"Test case {n} passed.")
except Exception:
print(f"Test case {n} failed.")
print("Your output is:")
print(arr_result)
print("Expected output is:")
print(arr_answer)
def answer_tokenize_and_lemmatize(df):
"""
Tokenize and lemmatize the text in the dataset.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the text column.
Returns
-------
pandas.DataFrame
The dataframe with the added tokens column.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Apply the tokenizer to create the tokens column.
df['tokens'] = df['text'].progress_apply(word_tokenize)
# Apply the lemmatizer on every word in the tokens list.
df['tokens'] = df['tokens'].progress_apply(lambda tokens: [lemmatizer.lemmatize(token, wordnet_pos(tag))
for token, tag in nltk.pos_tag(tokens)])
return df
def answer_most_used_words(df, token_col='tokens'):
"""
Generate a dataframe with the 5 most used words per class, and their count.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the class and tokens columns.
Returns
-------
pandas.DataFrame
The dataframe with 5 rows per class, and an added 'count' column.
The dataframe is sorted in ascending order on the class and in descending order on the count.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Filter out non-words
df[token_col] = df[token_col].apply(lambda tokens: [token for token in tokens if token.isalpha()])
# Explode the tokens so that every token gets its own row.
df = df.explode(token_col)
# Option 1: groupby on class and token, get the size of how many rows per item,
# add that as a column.
counts = df.groupby(['class', token_col]).size().reset_index(name='count')
# Option 2: make a pivot table based on the class and token based on how many
# rows per combination there are, add counts as a column.
# counts = counts.pivot_table(index=['class', 'tokens'], aggfunc='size').reset_index(name='count')
# Sort the values on the class and count, get only the first 5 rows per class.
counts = counts.sort_values(['class', 'count'], ascending=[True, False]).groupby('class').head()
return counts
def answer_remove_stopwords(df):
"""
Remove stopwords from the tokens.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the tokens column,
where the value in each row is a list of tokens.
Returns
-------
pandas.DataFrame
The dataframe with stopwords removed from the tokens column.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Using a set for quicker lookups.
stopwords_set = set(stopwords_list)
# Filter stopwords from tokens.
df['tokens'] = df['tokens'].apply(lambda tokens: [token for token in tokens
if token.lower() not in stopwords_set])
return df
def answer_spacy_tokens(df):
"""
Add a column with a list of lemmatized tokens, without stopwords.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the doc column.
Returns
-------
pandas.DataFrame
The dataframe with the spacy_tokens column.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
df['spacy_tokens'] = df['doc'].apply(lambda tokens: [token.lemma_ for token in tokens
if not token.is_stop])
return df
def answer_largest_proportion(arr):
"""
For every row, get the column number where it has the largest value.
Parameters
----------
arr : numpy.array
The array with the amount of topics as the amount of columns
and the amount of documents as the number of rows.
Every row should sum up to 1.
Returns
-------
pandas.DataFrame
The 1-dimensional array containing the label of the topic
the document has the largest proportion in.
"""
return np.argmax(arr, axis=1)
def answer_add_padded_tensors(df1, df2):
"""
Add a tensor column to the dataframes, with every tensor having the same dimensions.
Parameters
----------
df_train : pandas.DataFrame
The first dataframe containing at least the tokens or doc column.
df_test : pandas.DataFrame
The second dataframe containing at least the tokens or doc column.
Returns
-------
tuple[pandas.DataFrame]
The dataframes with the added tensor column.
"""
# Copy the dataframes to avoid editing the originals.
df1 = df1.copy(deep=True)
df2 = df2.copy(deep=True)
# Sample 10% from both datasets.
df1 = df1.sample(frac=0.1, random_state=42)
df2 = df2.sample(frac=0.1, random_state=42)
# Add tensors (option 1: our own model).
for df in [df1, df2]:
df['tensor'] = df['tokens'].apply(lambda tokens: np.vstack([w2v_model.wv[token]
for token in tokens]))
# Add tensors (option 2: spaCy tensors).
for df in [df1, df2]:
df['tensor'] = df['doc'].apply(lambda doc: doc.tensor)
# Determine the largest amount of columns.
largest = max(df1['tensor'].apply(lambda x: x.shape[0]).max(),
df2['tensor'].apply(lambda x: x.shape[0]).max())
# Pad our tensors to that amount.
for df in [df1, df2]:
df['tensor'] = df['tensor'].apply(lambda x: np.pad(x, ((0, largest - x.shape[0]), (0, 0))))
return df1, df2
# Confusion matrix code
# # Compute the confusion matrix
# cm = confusion_matrix(test_labels, test_pred)
# # Plot the confusion matrix using seaborn
# sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', xticklabels=labels, yticklabels=labels)
Task 3: Preprocess Text Data#
In this task, we will preprocess the text data from the AG News Dataset. First, we need to load the files.
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
display(df_train, df_test)
Class Index | Title | Description | |
---|---|---|---|
0 | 3 | Wall St. Bears Claw Back Into the Black (Reuters) | Reuters - Short-sellers, Wall Street's dwindli... |
1 | 3 | Carlyle Looks Toward Commercial Aerospace (Reu... | Reuters - Private investment firm Carlyle Grou... |
2 | 3 | Oil and Economy Cloud Stocks' Outlook (Reuters) | Reuters - Soaring crude prices plus worries\ab... |
3 | 3 | Iraq Halts Oil Exports from Main Southern Pipe... | Reuters - Authorities have halted oil export\f... |
4 | 3 | Oil prices soar to all-time record, posing new... | AFP - Tearaway world oil prices, toppling reco... |
... | ... | ... | ... |
119995 | 1 | Pakistan's Musharraf Says Won't Quit as Army C... | KARACHI (Reuters) - Pakistani President Perve... |
119996 | 2 | Renteria signing a top-shelf deal | Red Sox general manager Theo Epstein acknowled... |
119997 | 2 | Saban not going to Dolphins yet | The Miami Dolphins will put their courtship of... |
119998 | 2 | Today's NFL games | PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ... |
119999 | 2 | Nets get Carter from Raptors | INDIANAPOLIS -- All-Star Vince Carter was trad... |
120000 rows × 3 columns
Class Index | Title | Description | |
---|---|---|---|
0 | 3 | Fears for T N pension after talks | Unions representing workers at Turner Newall... |
1 | 4 | The Race is On: Second Private Team Sets Launc... | SPACE.com - TORONTO, Canada -- A second\team o... |
2 | 4 | Ky. Company Wins Grant to Study Peptides (AP) | AP - A company founded by a chemistry research... |
3 | 4 | Prediction Unit Helps Forecast Wildfires (AP) | AP - It's barely dawn when Mike Fitzpatrick st... |
4 | 4 | Calif. Aims to Limit Farm-Related Smog (AP) | AP - Southern California's smog-fighting agenc... |
... | ... | ... | ... |
7595 | 1 | Around the world | Ukrainian presidential candidate Viktor Yushch... |
7596 | 2 | Void is filled with Clement | With the supply of attractive pitching options... |
7597 | 2 | Martinez leaves bitter | Like Roger Clemens did almost exactly eight ye... |
7598 | 3 | 5 of arthritis patients in Singapore take Bext... | SINGAPORE : Doctors in the United States have ... |
7599 | 3 | EBay gets into rentals | EBay plans to buy the apartment and home renta... |
7600 rows × 3 columns
As you can see, all the classes are distributed evenly in the train and test data.
display(df_train['Class Index'].value_counts(), df_test['Class Index'].value_counts())
3 30000
4 30000
2 30000
1 30000
Name: Class Index, dtype: int64
3 1900
4 1900
2 1900
1 1900
Name: Class Index, dtype: int64
To make the data more understandable, we will make the classes more understandable by adding a class
column from the original Class Index
column, containing the category of the news article. To process both the title and news text together, we will combine the Title
and Description
columns into one text
column. We will deal with just the train data until the point where we need the test data again.
def reformat_data(df):
"""
Reformat the Class Index column to a Class column and combine
the Title and Description columns into a Text column.
Select only the class_idx, class and text columns afterwards.
Parameters
----------
df : pandas.DataFrame
The original dataframe.
Returns
-------
pandas.DataFrame
The reformatted dataframe.
"""
# Make the class column using a dictionary.
df = df.rename(columns={'Class Index': 'class_idx'})
classes = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
df['class'] = df['class_idx'].apply(classes.get)
# Use string concatonation for the Text column and unescape html characters.
df['text'] = (df['Title'] + ' ' + df['Description']).apply(su.unescape)
# Select only the class_idx, class, and text column.
df = df[['class_idx', 'class', 'text']]
return df
df_train = reformat_data(df_train)
display(df_train)
class_idx | class | text | |
---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... |
... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... |
120000 rows × 3 columns
Tokenization#
Tokenization is the process of breaking down a text into individual tokens, which are usually words but can also be phrases or sentences. It helps language models to understand and analyze text data by breaking it down into smaller, more manageable pieces. While it may seem like a trivial task, tokenization can be applied in multiple ways and thus be a complex and challenging task influencing NLP applications.
For example, in languages like English, it is generally straightforward to identify words by using spaces as delimiters. However, there are exceptions, such as contractions like “can’t” and hyphenated words like “self-driving”. And in Dutch, where multiple nouns can be combined into one bigger noun without any delimiter this can be hard. How would you tokenize “hippopotomonstrosesquippedaliofobie”? In other languages, such as Chinese and Japanese, there are no spaces between words, so identifying word boundaries is much more difficult.
To illustrate the use of tokenization, let’s consider the following example, which tokenizes a sample text using the word_tokenize
function from the NLTK package. That function uses a pre-trained tokenization model for English.
# Sample text.
text = "The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day."
# Tokenize the text.
tokens = word_tokenize(text)
# Print the text and the tokens.
print("Original text:", text)
print("Tokenized text:", tokens)
Original text: The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
Tokenized text: ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cats', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Part-of-speech tagging#
Part-of-speech (POS) tagging is the process of assigning each word in a text corpus with a specific part-of-speech tag based on its context and definition. The tags typically include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, and more. POS tagging can help other NLP tasks disambiguate a token somewhat due to the added context.
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.'), ('The', 'DT'), ('cats', 'NNS'), ('could', 'MD'), ("n't", 'RB'), ('wait', 'VB'), ('to', 'TO'), ('sleep', 'VB'), ('all', 'DT'), ('day', 'NN'), ('.', '.')]
Stemming / Lemmatization#
Stemming and lemmatization are two common techniques used in NLP to preprocess and normalize text data. Both techniques involve transforming words into their root form, but they differ in their approach and the level of normalization they provide.
Stemming is a technique that involves reducing words to their base or stem form by removing any affixes or suffixes. For example, the stem of the word “lazily” would be “lazi”. Stemming is a simple and fast technique that can be useful. However, it can also produce inaccurate or incorrect results since it does not consider the context or part of speech of the word.
Lemmatization, on the other hand, is a more sophisticated technique that involves identifying the base or dictionary form of a word, also known as the lemma. Unlike stemming, lemmatization can consider the context and part of speech of the word, which can make it more accurate and reliable. With lemmatization, the lemma of the word “lazily” would be “lazy”. Lemmatization can be slower and more complex than stemming but provides a higher level of normalization.
# Initialize the stemmer and lemmatizer.
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
def wordnet_pos(nltk_pos):
"""
Function to map POS tags to wordnet tags for lemmatizer.
"""
if nltk_pos.startswith('V'):
return wordnet.VERB
elif nltk_pos.startswith('J'):
return wordnet.ADJ
elif nltk_pos.startswith('R'):
return wordnet.ADV
return wordnet.NOUN
# Perform stemming and lemmatization seperately on the tokens.
stemmed_tokens = [stemmer.stem(token) for token in tokens]
lemmatized_tokens = [lemmatizer.lemmatize(token, wordnet_pos(tag))
for token, tag in nltk.pos_tag(tokens)]
# Print the results.
print("Stemmed text:", stemmed_tokens)
print("Lemmatized text:", lemmatized_tokens)
Stemmed text: ['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.', 'the', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Lemmatized text: ['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog', '.', 'The', 'cat', 'could', "n't", 'wait', 'to', 'sleep', 'all', 'day', '.']
Stopword removal#
Stopword removal is a common technique used in NLP to preprocess and clean text data by removing words that are considered to be of little or no value in terms of conveying meaning or information. These words are called “stopwords” and they include common words such as “the”, “a”, “an”, “and”, “or”, “but”, and so on.
The purpose of stopword removal in NLP is to improve the accuracy and efficiency of text analysis and processing by reducing the noise and complexity of the data. Stopwords are often used to form grammatical structures in a sentence, but they do not carry much meaning or relevance to the main topic or theme of the text. So by removing these words, we can reduce the dimensionality of the text data, improve the performance of machine learning models, and speed up the processing of text data. NLTK has a predefined list of stopwords for English.
# English stopwords in NLTK.
stopwords_list = stopwords.words('english')
print(stopwords_list)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Assignment for Task 3#
Your task (which is your assignment) is to write functions to do the following:
Since we want to use our text to make a model later on, we need to preprocess it. Add a
tokens
column to thedf_train
dataframe with the text tokenized, then lemmatize those tokens. You must use the POS tags when lemmatizing.Hint: Use the
pandas.Series.apply
function with the importednltk.tokenize.word_tokenize
function. Recall that you can use thepd.Series.apply?
syntax in a code cell for more information.Hint: use the
nltk.stem.WordNetLemmatizer.lemmatize
function to lemmatize a token. Use thewordnet_pos
function to obtain the POS tag for the lemmatizer.
Tokenizing and lemmatizing the entire dataset can take a while too. We use
tqdm
and thepandas.Series.progress_apply
in the answer version to show progress bars for the operations.Our goal is to have a dataframe that looks like the following:
# This part of code will take several minutes to run.
answer_df = answer_tokenize_and_lemmatize(df_train)
display(answer_df)
class_idx | class | text | tokens | |
---|---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... | [Wall, St., Bears, Claw, Back, Into, the, Blac... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... | [Carlyle, Looks, Toward, Commercial, Aerospace... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... | [Oil, and, Economy, Cloud, Stocks, ', Outlook,... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... | [Iraq, Halts, Oil, Exports, from, Main, Southe... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... | [Oil, price, soar, to, all-time, record, ,, po... |
... | ... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... | [Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... | [Renteria, sign, a, top-shelf, deal, Red, Sox,... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... | [Saban, not, go, to, Dolphins, yet, The, Miami... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... | [Today, 's, NFL, game, PITTSBURGH, at, NY, GIA... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... | [Nets, get, Carter, from, Raptors, INDIANAPOLI... |
120000 rows × 4 columns
To see what the most used words per class are, create a new, seperate dataframe with the 5 most used words per class. Sort the resulting dataframe ascending on the
class
and descending on thecount
.Hint: use the
pandas.Series.apply
andstr.isalpha()
functions to filter out non-alphabetical tokens.Hint: use the
pandas.DataFrame.explode
to create one row per class and token.Hint: use
pandas.DataFrame.groupby
with.size()
afterwards orpandas.DataFrame.pivot_table
withsize
as theaggfunc
to obtain the occurences per class.Hint: use the
pandas.Series.reset_index
function to obtain a dataframe with[class, tokens, count]
as the columns.Hint: use the
pandas.DataFrame.sort_values
function for sorting a dataframe.Hint: use the
pandas.DataFrame.groupby
andpandas.DataFrame.head
functions to get the first 5 rows per class.
Our goal is to have a dataframe that looks like the following:
display(answer_most_used_words(answer_df))
class | tokens | count | |
---|---|---|---|
28111 | Business | the | 37998 |
17122 | Business | a | 30841 |
28259 | Business | to | 29384 |
24422 | Business | of | 22539 |
22506 | Business | in | 21446 |
63963 | Sci/Tech | the | 40767 |
64137 | Sci/Tech | to | 30497 |
49851 | Sci/Tech | a | 27686 |
59243 | Sci/Tech | of | 26048 |
50993 | Sci/Tech | be | 19238 |
97908 | Sports | the | 56416 |
86396 | Sports | a | 29398 |
98078 | Sports | to | 27171 |
92067 | Sports | in | 23187 |
93981 | Sports | of | 20119 |
130423 | World | the | 42548 |
117945 | World | a | 32397 |
124106 | World | in | 31243 |
130599 | World | to | 30663 |
126243 | World | of | 28728 |
Remove the stopwords from the
tokens
column in thedf_train
dataframe. Then, check with themost_used_words
function: do the most used words say something about the class now?Hint: once again, you can use the
pandas.Series.apply
function.
The top 5 words per class should look like this after removing stopwords:
answer_df = answer_remove_stopwords(answer_df)
display(answer_most_used_words(answer_df))
class | tokens | count | |
---|---|---|---|
26317 | Business | say | 8879 |
12756 | Business | Reuters | 6893 |
15719 | Business | US | 5609 |
18895 | Business | company | 5062 |
25086 | Business | price | 4611 |
61365 | Sci/Tech | say | 5536 |
58378 | Sci/Tech | new | 5149 |
40328 | Sci/Tech | Microsoft | 5041 |
51846 | Sci/Tech | company | 3781 |
29300 | Sci/Tech | AP | 3682 |
65192 | Sports | AP | 6245 |
98155 | Sports | win | 5492 |
90163 | Sports | game | 3819 |
89768 | Sports | first | 3696 |
96792 | Sports | team | 3571 |
127374 | World | say | 10299 |
106496 | World | Iraq | 5760 |
98469 | World | AP | 5757 |
112273 | World | Reuters | 5406 |
123453 | World | kill | 4545 |
def tokenize_and_lemmatize(df):
"""
Tokenize and lemmatize the text in the dataset.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the text column.
Returns
-------
pandas.DataFrame
The dataframe with the added tokens column.
"""
###################################
# Fill in your answer here
return None
###################################
def most_used_words(df, token_col='tokens'):
"""
Generate a dataframe with the 5 most used words per class, and their count.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the class and tokens columns.
Returns
-------
pandas.DataFrame
The dataframe with 5 rows per class, and an added 'count' column.
The dataframe is sorted in ascending order on the class and in descending order on the count.
"""
###################################
# Fill in your answer here
return None
###################################
def remove_stopwords(df):
"""
Remove stopwords from the tokens.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the tokens column,
where the value in each row is a list of tokens.
Returns
-------
pandas.DataFrame
The dataframe with stopwords removed from the tokens column.
"""
###################################
# Fill in your answer here
return None
###################################
df_train = df_train # Edit this.
The code below tests if all your functions combined match the expected output.
check_answer_df(most_used_words(df_train), answer_most_used_words(answer_df))
Test case 1 failed.
Your output is:
None
Expected output is:
class | tokens | count | |
---|---|---|---|
26317 | Business | say | 8879 |
12756 | Business | Reuters | 6893 |
15719 | Business | US | 5609 |
18895 | Business | company | 5062 |
25086 | Business | price | 4611 |
61365 | Sci/Tech | say | 5536 |
58378 | Sci/Tech | new | 5149 |
40328 | Sci/Tech | Microsoft | 5041 |
51846 | Sci/Tech | company | 3781 |
29300 | Sci/Tech | AP | 3682 |
65192 | Sports | AP | 6245 |
98155 | Sports | win | 5492 |
90163 | Sports | game | 3819 |
89768 | Sports | first | 3696 |
96792 | Sports | team | 3571 |
127374 | World | say | 10299 |
106496 | World | Iraq | 5760 |
98469 | World | AP | 5757 |
112273 | World | Reuters | 5406 |
123453 | World | kill | 4545 |
Task 4: Another option: spaCy#
spaCy is another library used to perform various NLP tasks like tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and much more. It provides pre-trained models for different languages and domains, which can be used as-is but also can be fine-tuned on a specific task or domain.
In an object-oriented way, spaCy can be thought of as a collection of classes and objects that work together to perform NLP tasks. Some of the important functions and classes in spaCy include:
nlp
: The core function that provides the main functionality of spaCy. It is used to process text and create aDoc
object.Doc
: A container for accessing linguistic annotations like tokens, part-of-speech tags, named entities, and dependency parse information. It is created by thenlp
function and represents a processed document.Token
: An object representing a single token in aDoc
object. It contains information like the token text, part-of-speech tag, lemma, embedding, and much more.
When a text is processed by spaCy, it is first passed to the nlp
function, which uses the loaded model to tokenize the text and applies various linguistic annotations like part-of-speech tagging, named entity recognition, and dependency parsing in the background. The resulting annotations are stored in a Doc
object, which can be accessed and manipulated using various methods and attributes. For example, the Doc
object can be iterated over to access each Token
object in the document.
# Load the small English model in spaCy.
# Disable Named Entity Recognition and the parser in the model pipeline since we're not using it.
# Check the following website for the spaCy NLP pipeline:
# - https://spacy.io/usage/processing-pipelines
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# Process the text using spaCy.
doc = nlp(text)
# This becomes a spaCy Doc object, which prints nicely as the original string.
print(type(doc), doc)
# We can iterate over the tokens in the Doc, since it has already been tokenized underneath.
print(type(doc[0]))
for token in doc:
print(token)
<class 'spacy.tokens.doc.Doc'> The quick brown fox jumped over the lazy dog. The cats couldn't wait to sleep all day.
<class 'spacy.tokens.token.Token'>
The
quick
brown
fox
jumped
over
the
lazy
dog
.
The
cats
could
n't
wait
to
sleep
all
day
.
Since a lot of processing has already been done, we can also directly access multiple attributes of the Token
objects. For example, we can directly access the lemma of the token with Token.lemma_
and check if a token is a stop word with Token.is_stop
.
print(doc[0].lemma_, type(doc[0].lemma_), doc[0].is_stop, type(doc[0].is_stop))
the <class 'str'> True <class 'bool'>
Here is the code to add a column with a Doc
representation of the text
column to the dataframe. Executing this cell takes several minutes, so we added a progress bar.
def add_spacy(df):
"""
Add a column with the spaCy Doc objects.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the text column.
Returns
-------
pandas.DataFrame
The dataframe with the added doc column.
"""
# Copy the dataframe to avoid editing the original one.
df = df.copy(deep=True)
# Get the number of CPUs in the machine.
n_process = max(1, os.cpu_count()-2)
# Use multiple CPUs to speed up computing.
df['doc'] = [doc for doc in tqdm(nlp.pipe(df['text'], n_process=n_process), total=df.shape[0])]
return df
df_train = add_spacy(answer_df)
display(df_train)
class_idx | class | text | tokens | doc | |
---|---|---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... | [Wall, St., Bears, Claw, Back, Black, (, Reute... | (Wall, St., Bears, Claw, Back, Into, the, Blac... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... | [Carlyle, Looks, Toward, Commercial, Aerospace... | (Carlyle, Looks, Toward, Commercial, Aerospace... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... | [Oil, Economy, Cloud, Stocks, ', Outlook, (, R... | (Oil, and, Economy, Cloud, Stocks, ', Outlook,... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... | (Iraq, Halts, Oil, Exports, from, Main, Southe... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... | [Oil, price, soar, all-time, record, ,, pose, ... | (Oil, prices, soar, to, all, -, time, record, ... |
... | ... | ... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... | [Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | (Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... | [Renteria, sign, top-shelf, deal, Red, Sox, ge... | (Renteria, signing, a, top, -, shelf, deal, Re... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... | [Saban, go, Dolphins, yet, Miami, Dolphins, pu... | (Saban, not, going, to, Dolphins, yet, The, Mi... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... | [Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,... | (Today, 's, NFL, games, PITTSBURGH, at, NY, GI... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... | [Nets, get, Carter, Raptors, INDIANAPOLIS, --,... | (Nets, get, Carter, from, Raptors, INDIANAPOLI... |
120000 rows × 5 columns
Assignment for Task 4#
Your task (which is your assignment) is to write a function to do the following:
Add a
spacy_tokens
column to thedf_train
dataframe containing a list of lemmatized tokens as strings.Hint: use a
pandas.Series.apply
operation on thedoc
column to accomplish this.
Our goal is to have a dataframe that looks like the following:
answer_df = answer_spacy_tokens(df_train)
display(answer_df)
class_idx | class | text | tokens | doc | spacy_tokens | |
---|---|---|---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... | [Wall, St., Bears, Claw, Back, Black, (, Reute... | (Wall, St., Bears, Claw, Back, Into, the, Blac... | [Wall, St., Bears, Claw, Black, (, Reuters, ),... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... | [Carlyle, Looks, Toward, Commercial, Aerospace... | (Carlyle, Looks, Toward, Commercial, Aerospace... | [Carlyle, look, Commercial, Aerospace, (, Reut... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... | [Oil, Economy, Cloud, Stocks, ', Outlook, (, R... | (Oil, and, Economy, Cloud, Stocks, ', Outlook,... | [oil, Economy, Cloud, Stocks, ', Outlook, (, R... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... | (Iraq, Halts, Oil, Exports, from, Main, Southe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... | [Oil, price, soar, all-time, record, ,, pose, ... | (Oil, prices, soar, to, all, -, time, record, ... | [oil, price, soar, -, time, record, ,, pose, n... |
... | ... | ... | ... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... | [Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | (Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | [Pakistan, Musharraf, say, will, quit, Army, C... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... | [Renteria, sign, top-shelf, deal, Red, Sox, ge... | (Renteria, signing, a, top, -, shelf, deal, Re... | [Renteria, sign, -, shelf, deal, Red, Sox, gen... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... | [Saban, go, Dolphins, yet, Miami, Dolphins, pu... | (Saban, not, going, to, Dolphins, yet, The, Mi... | [Saban, go, Dolphins, Miami, Dolphins, courtsh... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... | [Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,... | (Today, 's, NFL, games, PITTSBURGH, at, NY, GI... | [today, NFL, game, PITTSBURGH, NY, giant, Time... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... | [Nets, get, Carter, Raptors, INDIANAPOLIS, --,... | (Nets, get, Carter, from, Raptors, INDIANAPOLI... | [net, Carter, Raptors, INDIANAPOLIS, --, -, St... |
120000 rows × 6 columns
def spacy_tokens(df):
"""
Add a column with a list of lemmatized tokens, without stopwords.
Parameters
----------
df : pandas.DataFrame
The dataframe containing at least the doc column.
Returns
-------
pandas.DataFrame
The dataframe with the spacy_tokens column.
"""
###################################
# Fill in your answer here
return None
###################################
df_train = df_train # Edit this.
The code below tests if the function matches the expected output.
check_answer_df(df_train, answer_df)
Test case 1 failed.
Your output is:
class_idx | class | text | tokens | doc | |
---|---|---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... | [Wall, St., Bears, Claw, Back, Black, (, Reute... | (Wall, St., Bears, Claw, Back, Into, the, Blac... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... | [Carlyle, Looks, Toward, Commercial, Aerospace... | (Carlyle, Looks, Toward, Commercial, Aerospace... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... | [Oil, Economy, Cloud, Stocks, ', Outlook, (, R... | (Oil, and, Economy, Cloud, Stocks, ', Outlook,... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... | (Iraq, Halts, Oil, Exports, from, Main, Southe... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... | [Oil, price, soar, all-time, record, ,, pose, ... | (Oil, prices, soar, to, all, -, time, record, ... |
... | ... | ... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... | [Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | (Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... | [Renteria, sign, top-shelf, deal, Red, Sox, ge... | (Renteria, signing, a, top, -, shelf, deal, Re... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... | [Saban, go, Dolphins, yet, Miami, Dolphins, pu... | (Saban, not, going, to, Dolphins, yet, The, Mi... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... | [Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,... | (Today, 's, NFL, games, PITTSBURGH, at, NY, GI... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... | [Nets, get, Carter, Raptors, INDIANAPOLIS, --,... | (Nets, get, Carter, from, Raptors, INDIANAPOLI... |
120000 rows × 5 columns
Expected output is:
class_idx | class | text | tokens | doc | spacy_tokens | |
---|---|---|---|---|---|---|
0 | 3 | Business | Wall St. Bears Claw Back Into the Black (Reute... | [Wall, St., Bears, Claw, Back, Black, (, Reute... | (Wall, St., Bears, Claw, Back, Into, the, Blac... | [Wall, St., Bears, Claw, Black, (, Reuters, ),... |
1 | 3 | Business | Carlyle Looks Toward Commercial Aerospace (Reu... | [Carlyle, Looks, Toward, Commercial, Aerospace... | (Carlyle, Looks, Toward, Commercial, Aerospace... | [Carlyle, look, Commercial, Aerospace, (, Reut... |
2 | 3 | Business | Oil and Economy Cloud Stocks' Outlook (Reuters... | [Oil, Economy, Cloud, Stocks, ', Outlook, (, R... | (Oil, and, Economy, Cloud, Stocks, ', Outlook,... | [oil, Economy, Cloud, Stocks, ', Outlook, (, R... |
3 | 3 | Business | Iraq Halts Oil Exports from Main Southern Pipe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... | (Iraq, Halts, Oil, Exports, from, Main, Southe... | [Iraq, Halts, Oil, Exports, Main, Southern, Pi... |
4 | 3 | Business | Oil prices soar to all-time record, posing new... | [Oil, price, soar, all-time, record, ,, pose, ... | (Oil, prices, soar, to, all, -, time, record, ... | [oil, price, soar, -, time, record, ,, pose, n... |
... | ... | ... | ... | ... | ... | ... |
119995 | 1 | World | Pakistan's Musharraf Says Won't Quit as Army C... | [Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | (Pakistan, 's, Musharraf, Says, Wo, n't, Quit,... | [Pakistan, Musharraf, say, will, quit, Army, C... |
119996 | 2 | Sports | Renteria signing a top-shelf deal Red Sox gene... | [Renteria, sign, top-shelf, deal, Red, Sox, ge... | (Renteria, signing, a, top, -, shelf, deal, Re... | [Renteria, sign, -, shelf, deal, Red, Sox, gen... |
119997 | 2 | Sports | Saban not going to Dolphins yet The Miami Dolp... | [Saban, go, Dolphins, yet, Miami, Dolphins, pu... | (Saban, not, going, to, Dolphins, yet, The, Mi... | [Saban, go, Dolphins, Miami, Dolphins, courtsh... |
119998 | 2 | Sports | Today's NFL games PITTSBURGH at NY GIANTS Time... | [Today, 's, NFL, game, PITTSBURGH, NY, GIANTS,... | (Today, 's, NFL, games, PITTSBURGH, at, NY, GI... | [today, NFL, game, PITTSBURGH, NY, giant, Time... |
119999 | 2 | Sports | Nets get Carter from Raptors INDIANAPOLIS -- A... | [Nets, get, Carter, Raptors, INDIANAPOLIS, --,... | (Nets, get, Carter, from, Raptors, INDIANAPOLI... | [net, Carter, Raptors, INDIANAPOLIS, --, -, St... |
120000 rows × 6 columns
We use the answer version of the most_used_words
function to again display the top 5 words per class in the dataset. Do you see some differences between the lemmatized tokens obtained from NLTK and spaCy?
display(answer_most_used_words(answer_df, 'spacy_tokens'))
class | spacy_tokens | count | |
---|---|---|---|
24965 | Business | say | 8927 |
11332 | Business | Reuters | 6885 |
22746 | Business | oil | 5402 |
17106 | Business | company | 5107 |
23716 | Business | price | 4951 |
55144 | Sci/Tech | new | 5549 |
37848 | Sci/Tech | Microsoft | 5073 |
58225 | Sci/Tech | say | 5023 |
48269 | Sci/Tech | company | 3847 |
28075 | Sci/Tech | AP | 3692 |
62256 | Sports | AP | 6262 |
94335 | Sports | win | 5995 |
85416 | Sports | game | 4459 |
92830 | Sports | team | 3738 |
91307 | Sports | season | 3685 |
121838 | World | say | 10174 |
94687 | World | AP | 5786 |
101753 | World | Iraq | 5783 |
106953 | World | Reuters | 5414 |
117760 | World | kill | 4967 |
Task 5: Unsupervised Learning - Topic Modelling#
Topic modelling is a technique used in NLP that aims to identify the underlying topics or themes in a collection of texts. One way to perform topic modelling is using the probabilistic model Latent Dirichlet Allocation (LDA).
LDA assumes that each document in a collection is a mixture of different topics, and each topic is a probability distribution over a set of words. The model then infers the underlying topic distribution for each document in the collection and the word distribution for each topic. LDA is trained using an iterative algorithm that maximizes the likelihood of observing the given documents.
To use LDA, we need to represent the documents as a bag of words, where the order of the words is ignored and only the frequency of each word in the document is considered. This bag-of-words representation allows us to represent each document as a vector of word frequencies, which can be used as input to the LDA algorithm. Computing LDA might take a moment on our dataset size.
# Define the number of topics to model with LDA.
num_topics = 4
# Convert preprocessed text to bag-of-words representation using CountVectorizer.
vectorizer = CountVectorizer(max_features=50000)
# fit_transform requires either a string as input or multiple extra arguments
# and functions, so we simply turn the tokens into string.
X = vectorizer.fit_transform(answer_df['spacy_tokens'].apply(lambda x: ' '.join(x)).values)
# Fit LDA to the feature matrix. Verbose so we know what iteration we're on.
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=10, random_state=42, verbose=True)
lda.fit(X)
# Extract the topic proportions for each document.
doc_topic_proportions = lda.transform(X)
iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10
Using this function, we can take a look at the most important words per topic. Do you see some similarities with the most occurring words per class after stopword removal?
def n_top_wordlist(model, features, ntopwords=5):
"""
Get the 5 most important words per LDA topic.
"""
output = {}
for topic_idx, topic in enumerate(model.components_):
output[topic_idx] = [features[i] for i in topic.argsort()[:-ntopwords - 1:-1]]
return output
# Get the words from the CountVectorizer.
tf_feature_names = vectorizer.get_feature_names_out()
display(n_top_wordlist(lda, tf_feature_names))
{0: ['39', 'win', 'ap', 'game', 'team'],
1: ['39', 'new', 'microsoft', 'say', 'company'],
2: ['reuters', 'say', '39', 'oil', 'new'],
3: ['say', '39', 'ap', 'reuters', 'president']}
Evaluation#
Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) are two metrics used to evaluate the performance of clustering algorithms.
AMI is a measure that takes into account the possibility of two random clusters appearing to be similar. It is calculated as the difference between the Mutual Information (MI) of two clusterings and the expected MI, divided by the average entropy of the two clusterings minus the expected MI. AMI ranges between 0 and 1, where 0 indicates no agreement between the two clusterings and 1 indicates identical clusterings.
The Rand Index (RI) is a measure that counts the number of pairs of samples that are assigned to the same or different clusters in both the predicted and true clusterings. The raw RI score is then adjusted for chance into the ARI score using a scheme similar to that of AMI. For ARI a score of 0 indicates random labeling and 1 indicates perfect agreement. The ARI is bounded below by -0.5 for very large differences in labeling.
Assignment for Task 5#
Your task (which is your assignment) is to write a function to do the following:
The
doc_topic_proportions
contains the proportions of how much that document belongs to every topic. For every document, get the topic in which it has the largest proportion. Afterwards, look at the AMI and ARI scores. Can you improve the scores by modelling more topics, using a different set of tokens, or using more epochs?Hint: use the
numpy.argmax
function.
Our goal is to get an array that looks like the following:
answer_topic_most = answer_largest_proportion(doc_topic_proportions)
print(answer_topic_most, answer_topic_most.shape)
[2 1 2 ... 0 0 0] (120000,)
def largest_proportion(arr):
"""
For every row, get the column number where it has the largest value.
Parameters
----------
arr : numpy.array
The array with the amount of topics as the amount of columns
and the amount of documents as the number of rows.
Every row should sum up to 1.
Returns
-------
numpy.array
The 1-dimensional array containing the label of the topic
the document has the largest proportion in.
"""
###################################
# Fill in your answer here
return None
###################################
The code below tests if the function matches the expected output.
topic_most = largest_proportion(doc_topic_proportions)
check_answer_np(topic_most, answer_topic_most)
Test case 1 failed.
Your output is:
None
Expected output is:
[2 1 2 ... 0 0 0]
ami_score = adjusted_mutual_info_score(df_train['class'], answer_topic_most)
ari_score = adjusted_rand_score(df_train['class'], answer_topic_most)
print(f"Adjusted mutual information score: {ami_score:.2f}")
print(f"Adjusted rand score: {ari_score:.2f}")
Adjusted mutual information score: 0.52
Adjusted rand score: 0.54
Do some topics get (way) more documents assigned to them than others? Let’s take a look.
unique, counts = np.unique(answer_topic_most, return_counts=True)
print(dict(zip(unique, counts)))
{0: 30630, 1: 30442, 2: 27348, 3: 31580}
Task 6: Word Embeddings#
Word embeddings represent words as vectors in a high-dimensional space. The key idea behind word embeddings is that words with similar meanings tend to appear in similar contexts, and therefore their vector representations should be close together in this high-dimensional space. Word embeddings have been widely used in various NLP tasks such as sentiment analysis, machine translation, and information retrieval.
There are several techniques to generate word embeddings, but one of the most popular methods is the Word2Vec algorithm, which is based on a neural network architecture. Word2Vec learns embeddings by predicting the probability of a word given its context (continuous bag of words or skip-gram model). The output of the network is a set of word vectors that can be used as embeddings.
We can train a Word2Vec model ourselves, but keep in mind that later on, it’s not nice if we don’t have embeddings for certain words in the test set. So let’s first apply the familiar preprocessing steps to the test set:
# Reformat df_test.
df_test = reformat_data(df_test)
# NLTK preprocessing.
df_test = answer_remove_stopwords(answer_tokenize_and_lemmatize(df_test))
# spaCy preprocessing.
df_test = answer_spacy_tokens(add_spacy(df_test))
display(df_test)
class_idx | class | text | tokens | doc | spacy_tokens | |
---|---|---|---|---|---|---|
0 | 3 | Business | Fears for T N pension after talks Unions repre... | [Fears, N, pension, talk, Unions, represent, w... | (Fears, for, T, N, pension, after, talks, Unio... | [fear, t, N, pension, talk, union, represent, ... |
1 | 4 | Sci/Tech | The Race is On: Second Private Team Sets Launc... | [Race, :, Second, Private, Team, Sets, Launch,... | (The, Race, is, On, :, Second, Private, Team, ... | [Race, :, Second, Private, Team, Sets, Launch,... |
2 | 4 | Sci/Tech | Ky. Company Wins Grant to Study Peptides (AP) ... | [Ky., Company, Wins, Grant, Study, Peptides, (... | (Ky., Company, Wins, Grant, to, Study, Peptide... | [Ky., Company, Wins, Grant, Study, Peptides, (... |
3 | 4 | Sci/Tech | Prediction Unit Helps Forecast Wildfires (AP) ... | [Prediction, Unit, Helps, Forecast, Wildfires,... | (Prediction, Unit, Helps, Forecast, Wildfires,... | [Prediction, Unit, help, Forecast, Wildfires, ... |
4 | 4 | Sci/Tech | Calif. Aims to Limit Farm-Related Smog (AP) AP... | [Calif, ., Aims, Limit, Farm-Related, Smog, (,... | (Calif., Aims, to, Limit, Farm, -, Related, Sm... | [Calif., aim, Limit, Farm, -, relate, Smog, (,... |
... | ... | ... | ... | ... | ... | ... |
7595 | 1 | World | Around the world Ukrainian presidential candid... | [Around, world, Ukrainian, presidential, candi... | (Around, the, world, Ukrainian, presidential, ... | [world, ukrainian, presidential, candidate, Vi... |
7596 | 2 | Sports | Void is filled with Clement With the supply of... | [Void, fill, Clement, supply, attractive, pitc... | (Void, is, filled, with, Clement, With, the, s... | [Void, fill, Clement, supply, attractive, pitc... |
7597 | 2 | Sports | Martinez leaves bitter Like Roger Clemens did ... | [Martinez, leave, bitter, Like, Roger, Clemens... | (Martinez, leaves, bitter, Like, Roger, Clemen... | [Martinez, leave, bitter, like, Roger, Clemens... |
7598 | 3 | Business | 5 of arthritis patients in Singapore take Bext... | [5, arthritis, patient, Singapore, take, Bextr... | (5, of, arthritis, patients, in, Singapore, ta... | [5, arthritis, patient, Singapore, Bextra, Cel... |
7599 | 3 | Business | EBay gets into rentals EBay plans to buy the a... | [EBay, get, rental, EBay, plan, buy, apartment... | (EBay, gets, into, rentals, EBay, plans, to, b... | [EBay, get, rental, EBay, plan, buy, apartment... |
7600 rows × 6 columns
To obtain the complete model, we combine the tokens
column into one series and call the Word2Vec
function.
# Get all tokens into one series.
tokens_both = pd.concat([df_train['tokens'], df_test['tokens']])
# Train a Word2Vec model on the NLTK tokens.
w2v_model = Word2Vec(tokens_both.values, vector_size=96, min_count=1)
To obtain the embeddings, we can use the Word2Vec.wv[word]
syntax. To get multiple vectors nicely next to each other in a 2D matrix, we can call numpy.vstack
.
print(np.vstack([w2v_model.wv[word] for word in ['rain', 'cat', 'dog']]))
[[-2.2940760e+00 -4.8420104e-01 -4.4042167e-01 6.8525958e-01
-1.4821498e+00 2.4811094e+00 -9.9135214e-01 3.2920041e-05
4.2954880e-01 1.3281231e+00 -1.4077977e+00 -3.8684130e-01
-7.6716363e-02 -1.1165497e+00 9.3549109e-01 1.0929279e+00
2.5643912e-01 4.6144328e-01 7.0888650e-01 1.1422478e+00
-7.2016567e-01 -1.5233663e+00 -3.0344512e+00 -2.2352202e+00
5.0477499e-01 -1.3620797e-01 1.1560580e-01 -8.7251730e-02
1.3159508e+00 -3.0589628e-01 1.8796422e+00 4.5079884e-01
1.9805964e+00 7.5381422e-01 -1.4104112e+00 2.2851133e-01
2.0000076e-01 6.1606807e-01 -2.5055645e+00 1.3439629e+00
1.4786874e-01 5.2414012e-01 -7.3734067e-02 5.8897328e-01
-2.9820289e-02 -7.5076813e-01 -7.3534161e-01 -2.8183448e-01
-1.7987163e+00 1.0794261e+00 -2.5938660e-01 6.9268233e-01
1.0267434e+00 3.8907737e-01 1.7781268e+00 -1.0220027e+00
-5.6030327e-01 1.7142199e+00 -1.0593474e+00 9.3658783e-02
1.5736760e+00 1.4056033e+00 -1.0150774e+00 3.0287990e-01
1.5140053e+00 -8.6504275e-01 -1.1736679e+00 -1.6786666e+00
-1.0144037e+00 -1.3584304e+00 -9.1850632e-01 1.3619289e+00
1.2904971e+00 -1.6977031e+00 -7.3090231e-01 -3.4742847e-01
6.7464776e-02 8.9619391e-02 -1.1155491e+00 -6.6920400e-01
-1.3479675e-01 6.5459555e-01 -5.8369678e-01 -1.0921571e+00
4.1924748e-01 3.1124693e-01 -1.3012956e+00 2.1808662e-01
6.9067222e-01 1.6104497e+00 1.3644813e+00 -3.1946927e-02
-7.4468619e-01 -5.9339243e-01 4.1335011e-01 7.3781967e-01]
[-1.6707586e-01 -1.7751616e-01 -1.5991940e-01 -1.9989851e-01
-2.7424154e-01 3.6543764e-02 2.1962922e-02 -2.4741021e-01
-4.4054174e-01 -1.8425202e-01 1.1129478e-01 2.0092180e-01
-8.3667494e-02 -5.8217788e-01 -6.6241229e-01 4.6616954e-01
-3.7988979e-01 -7.2291568e-02 -3.9707631e-01 2.8176224e-01
4.9108905e-01 -2.1542142e-01 3.4716297e-02 -2.0940323e-01
2.5171158e-01 -1.9280653e-01 -5.5869442e-01 -2.5533444e-01
-2.0840354e-01 -1.1751265e-01 -4.3067787e-02 5.4453087e-01
-1.4280096e-01 1.8038911e-01 2.2600693e-01 -1.9326421e-02
2.5109005e-01 -7.1796763e-01 6.8420961e-02 4.5902830e-01
8.8544242e-02 -1.1893602e-01 6.9759995e-01 1.7631820e-01
-1.6763940e-01 -2.9361406e-01 -4.1476540e-02 -1.7099978e-02
-3.8931394e-01 4.3411884e-01 3.8996592e-02 1.6359149e-01
-1.8415020e-01 -2.8604241e-02 3.6487141e-01 -3.8500302e-02
2.6624441e-01 2.9064500e-01 -2.8262255e-01 1.2855190e-01
3.8448983e-01 -2.5975108e-01 -3.1233981e-01 1.6406569e-01
3.4801754e-01 -4.9045366e-01 1.7986092e-01 6.9124356e-02
-3.0077201e-01 -2.3007296e-01 2.5923157e-01 -1.6732115e-02
-3.8551494e-01 1.1382518e-01 -3.3340454e-01 -6.4890683e-02
5.5055684e-01 5.0011641e-01 -2.8933269e-01 -3.8424399e-02
6.6647522e-02 -4.9694654e-02 -1.7739089e-01 -3.3246987e-02
3.2536820e-01 3.2836625e-01 -1.9247569e-01 -2.1271285e-02
-1.0841837e-01 1.4557414e-01 2.8648019e-01 -5.0694276e-02
-5.6086332e-02 1.8716384e-01 5.0777632e-01 3.0855319e-01]
[-5.3797096e-01 -4.1078672e-01 -4.6644393e-01 -1.7776264e-01
-6.1859310e-01 1.3632537e-01 8.6262353e-02 -4.7705841e-01
-4.5533961e-01 -1.8748164e-02 5.8455920e-01 -4.3029517e-02
-1.8955900e-01 -7.9340595e-01 -1.1365076e+00 7.3272949e-01
-2.8275236e-01 -1.8743893e-02 -2.1410777e-01 1.3195230e-02
4.7014341e-01 -9.3369074e-02 -3.8174710e-01 -3.7527493e-01
1.2884773e-01 -2.0903893e-01 -4.8333859e-01 -3.7088567e-01
-1.7931862e-02 -6.5597802e-02 1.8665287e-01 4.3456262e-01
1.5434964e-02 2.8702241e-01 3.7848458e-01 4.7424236e-01
5.1574731e-01 -8.4790844e-01 -6.2235701e-01 5.6008190e-01
1.5852283e-01 2.0123389e-01 5.5801046e-01 6.6353047e-01
-2.4250355e-01 -5.0138640e-01 -4.9429446e-01 -1.3960998e-01
-6.4502162e-01 8.8502163e-01 -4.5185152e-01 1.0153203e+00
-2.1195376e-01 -1.9655213e-01 5.2993536e-01 -3.8079235e-01
2.0738366e-01 4.7517672e-01 -1.0797164e+00 -1.9228728e-01
3.9164144e-01 -1.4024313e-01 -3.4790218e-01 3.5092226e-01
3.5909703e-01 -1.9561702e-01 -3.6081094e-01 2.2829367e-01
-9.2727669e-02 4.3977797e-02 3.8294426e-01 2.7670828e-01
-9.1451295e-02 2.8349352e-01 8.1916869e-02 -1.2423317e-01
2.2019100e-01 2.3845011e-01 2.2413979e-01 -2.7076536e-01
-4.0909567e-01 1.2636584e-01 1.8710716e-01 6.3947570e-01
2.6947409e-01 4.5367742e-01 -7.8334993e-01 3.4023982e-02
2.5921276e-01 2.1218823e-01 4.7835699e-01 -2.5282547e-01
1.3911194e-02 -4.7073115e-02 1.1101334e+00 6.6093242e-01]]
The spaCy model we used has a Tok2Vec
algorithm in its pipeline, so we can directly access the 2D matrix of all word vectors on a document with the Doc.tensor
attribute. Keep in mind this still contains the embeddings of the stopwords.
print(doc.tensor)
[[ 0.30832282 -0.4600038 -0.6966157 ... -0.03586689 -0.8444165
0.33138227]
[-0.16609877 0.26174432 -0.34486908 ... 0.11861527 -0.11567482
-0.9424331 ]
[-0.1972521 -0.06649795 -0.5903488 ... -0.6296763 0.11471006
-0.27722898]
...
[-0.7060187 -0.5229768 0.8356328 ... -0.76241744 -1.0825483
-0.1288386 ]
[ 0.28227934 0.8741193 1.4112176 ... 1.3289344 0.23879066
0.7562135 ]
[-0.8391066 1.0865963 2.1023006 ... -0.42140386 0.18943703
1.1026181 ]]
Assignment for Task 6#
Your task (which is your assignment) is to write a function to do the following:
First, sample 10% of the data from both datasets, we’re not going to be using all the data for the neural network.
Hint: use the
pandas.DataFrame.sample
function to sample a fraction of the data. Specify arandom_state
value to always get the same rows from the dataframe.
Add a
tensor
column to both the test and train dataframes, the column should hold one array per row, containing all the word embedding vectors as columns. You can choose whether to use vectors from our new model or the ones from spaCy.Determine the largest amount of columns in the
tensor
column, between both datasets.Hint: use the
numpy.ndarray.shape
attribute to see the dimensions of an array.Hint: use the
pd.Series.max
function to determine the largest item in a series.
Pad all arrays in the
tensor
column to be equal in size to the biggest tensor, by adding columns of zeroes at the end. This way all inputs for a neural network have the same size.Hint: use the
numpy.pad
function to pad an array.
After the function, our
df_train
could look like the following:
answer_df_train, answer_df_test = answer_add_padded_tensors(answer_df, df_test)
display(answer_df_train)
class_idx | class | text | tokens | doc | spacy_tokens | tensor | |
---|---|---|---|---|---|---|---|
71787 | 3 | Business | BBC set for major shake-up, claims newspaper L... | [BBC, set, major, shake-up, ,, claim, newspape... | (BBC, set, for, major, shake, -, up, ,, claims... | [BBC, set, major, shake, -, ,, claim, newspape... | [[0.15803973, 0.75629187, -0.17498428, -0.2533... |
67218 | 3 | Business | Marsh averts cash crunch Embattled insurance b... | [Marsh, avert, cash, crunch, Embattled, insura... | (Marsh, averts, cash, crunch, Embattled, insur... | [Marsh, avert, cash, crunch, embattle, insuran... | [[-0.19821277, -0.14202696, -0.73497164, -0.32... |
54066 | 2 | Sports | Jeter, Yankees Look to Take Control (AP) AP - ... | [Jeter, ,, Yankees, Look, Take, Control, (, AP... | (Jeter, ,, Yankees, Look, to, Take, Control, (... | [Jeter, ,, Yankees, look, Control, (, AP, ), A... | [[0.35961235, 0.85889757, -0.8120774, -1.26034... |
7168 | 4 | Sci/Tech | Flying the Sun to Safety When the Genesis caps... | [Flying, Sun, Safety, Genesis, capsule, come, ... | (Flying, the, Sun, to, Safety, When, the, Gene... | [fly, Sun, safety, Genesis, capsule, come, Ear... | [[-0.76188123, -0.41965377, -0.75621116, -0.21... |
29618 | 3 | Business | Stocks Seen Flat as Nortel and Oil Weigh NEW ... | [Stocks, Seen, Flat, Nortel, Oil, Weigh, NEW, ... | (Stocks, Seen, Flat, as, Nortel, and, Oil, Wei... | [stock, see, Flat, Nortel, Oil, Weigh, , NEW,... | [[-0.3134395, -0.066423565, 0.8806261, 0.31077... |
... | ... | ... | ... | ... | ... | ... | ... |
27162 | 2 | Sports | Oakland Athletics Team Report - September 14 (... | [Oakland, Athletics, Team, Report, -, Septembe... | (Oakland, Athletics, Team, Report, -, Septembe... | [Oakland, Athletics, Team, Report, -, Septembe... | [[1.7174376, -0.022900194, -0.041351043, -0.71... |
82268 | 4 | Sci/Tech | Telcos' convergence strategies diverge CANNES ... | [Telcos, ', convergence, strategy, diverge, CA... | (Telcos, ', convergence, strategies, diverge, ... | [Telcos, ', convergence, strategy, diverge, CA... | [[1.2340894, 0.39759836, 0.4969958, -0.0863048... |
7765 | 4 | Sci/Tech | Motive aims to head off system glitches Three ... | [Motive, aim, head, system, glitch, Three, new... | (Motive, aims, to, head, off, system, glitches... | [motive, aim, head, system, glitche, new, soft... | [[-0.5199754, -0.008618206, -0.70200384, -0.55... |
25871 | 3 | Business | Campbell #39;s 4th-Qtr Net Drops 20 on Higher ... | [Campbell, #, 39, ;, 4th-Qtr, Net, Drops, 20, ... | (Campbell, #, 39;s, 4th, -, Qtr, Net, Drops, 2... | [Campbell, #, 39;s, 4th, -, Qtr, Net, Drops, 2... | [[0.40166336, -0.34546575, -0.52296686, 0.4144... |
57234 | 4 | Sci/Tech | MySQL to make use of Microsoft code The code, ... | [MySQL, make, use, Microsoft, code, code, ,, o... | (MySQL, to, make, use, of, Microsoft, code, Th... | [mysql, use, Microsoft, code, code, ,, open, -... | [[-0.31237996, -0.11844462, 0.38788873, 0.2587... |
12000 rows × 7 columns
def add_padded_tensors(df1, df2):
"""
First, sample 10% of both datasets and only use that.
Then, add a tensor column to the dataframes, with every tensor having the same dimensions.
Parameters
----------
df_train : pandas.DataFrame
The first dataframe containing at least the tokens or doc column.
df_test : pandas.DataFrame
The second dataframe containing at least the tokens or doc column.
Returns
-------
tuple[pandas.DataFrame]
The sampled dataframes with the added tensor column.
"""
###################################
# Fill in your answer here
return None
###################################
df_train, df_test = df_train, df_test # Edit this.
The code below tests if the function matches the expected output.
try:
assert (df_train['tensor'].apply(lambda x: x.shape).unique() ==
df_test['tensor'].apply(lambda x: x.shape).unique())
print("Test case 1 passed.")
except Exception:
print("Test case 1 failed. Not all tensor sizes are equal.")
try:
assert df_test.shape[0] == 760
print("Test case 1 passed.")
except Exception:
print("Test case 2 failed. The test dataframe does not have the correct size.")
Test case 1 failed. Not all tensor sizes are equal.
Test case 2 failed. The test dataframe does not have the correct size.
Task 7: Supervised Learning - Topic Classification#
Topic classification is a task in NLP that involves automatically assigning a given text document to one or more predefined categories or topics. This task is essential for various applications, such as document organization, search engines, sentiment analysis, and more.
In recent years, deep learning models have shown remarkable performance in various NLP tasks, including topic classification. We will explore a neural network-based approach for topic classification using the PyTorch framework. PyTorch provides an efficient way to build and train neural networks with a high degree of flexibility and ease of use.
Our neural network will take the embedding representation of the document as input and predict the corresponding topic using a softmax output layer. We will evaluate the performance of our model using various metrics such as accuracy, precision, recall, and F1-score.
The following code demonstrates how to implement a neural network for topic classification in PyTorch. First let’s do some more preparations for our inputs, turning them into PyTorch tensors.
# Use dataframes from previous assignment. These use the spaCy tensors.
df_train = answer_df_train
df_test = answer_df_test
# Transform inputs into PyTorch tensors.
input_train = torch.from_numpy(np.stack(df_train['tensor']))
input_test = torch.from_numpy(np.stack(df_test['tensor']))
# Get the labels, move to 0-indexed instead of 1-indexed.
train_labels = torch.from_numpy(df_train['class_idx'].values) - 1
test_labels = torch.from_numpy(df_test['class_idx'].values) - 1
# One-hot encode labels for training.
train_target = torch.zeros((len(train_labels), 4))
train_target = train_target.scatter_(1, train_labels.unsqueeze(1), 1).unsqueeze(1)
Then, it’s time to define our network. The neural net consists of three fully connected layers (fc1
, fc2
, and fc3
) with ReLU activation (relu
) in between each layer. We flatten the input tensor using view
before passing it through the fully connected layers. Finally, we apply the softmax activation function (softmax
) to the output tensor to obtain the predicted probabilities for each class.
class TopicClassifier(nn.Module):
def __init__(self, input_width, input_length, output_size):
super(TopicClassifier, self).__init__()
self.input_width = input_width
self.input_length = input_length
self.output_size = output_size
self.fc1 = nn.Linear(input_width * input_length, 128)
self.fc2 = nn.Linear(128, 64)
self.fc3 = nn.Linear(64, output_size)
self.relu = nn.ReLU()
self.softmax = nn.Softmax(dim=1)
def forward(self, x):
# Flatten the input tensor.
x = x.view(-1, self.input_width * self.input_length)
# Pass through the fully connected layers with ReLU activation.
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
x = self.relu(x)
x = self.fc3(x)
# Apply softmax activation to the output.
x = self.softmax(x)
return x
Now it’s time to train our network, this may take a while, but the current loss will be printed after every epoch. If you want to run the code faster, you can also put this notebook on Google Colab and use its provided GPU to speed up computing.
# Define parameters.
n_classes = len(train_labels.unique())
input_size = input_train.shape[1:]
num_epochs = 5
lr = 0.001
# Define model, loss function and optimizer.
model = TopicClassifier(*input_size, output_size=n_classes)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
# Training loop.
for epoch in range(num_epochs):
for i, (inputs, labels) in enumerate(zip(input_train, train_target)):
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
print(f"Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}")
Epoch [1/5], Loss: 1.3646
Epoch [2/5], Loss: 1.2608
Epoch [3/5], Loss: 1.1472
Epoch [4/5], Loss: 0.9604
Epoch [5/5], Loss: 0.8177
Assignment for Task 7 (Optional)#
Your task (which is your assignment) is to do the following:
Use the code below to evaluate the neural network.
Generate a confusion matrix with
sklearn.metrics.confusion_matrix
(it’s already imported so you can callconfusion_matrix
).Plot the confusion matrix using
seaborn.heatmap
(seaborn
is usually imported assns
). Set theannot
argument toTrue
and thexticklabels
andyticklabels
to thelabels
list.Also, take the time to evaluate the train set. Is there a notable difference in accuracy, precision, recall, and the F1 score between the train and test sets?
# Evaluate the neural net on the test set.
model.eval()
# Sample from the model.
with torch.no_grad():
test_outputs = model(input_test)
# Reuse our previous function to get the label with biggest probability.
test_pred = answer_largest_proportion(test_outputs.detach())
# Set model back to training mode.
model.train()
labels = ['World', 'Sports', 'Business', 'Sci/Tech']
###################################
# Fill in your answer here
###################################
Optional Assignment / Takeaways#
If you do not feel done with text data yet, there’s always more to do. You can still experiment with the number of epochs, learning rate, vector size, optimizer, neural network layers, regularization and so much more. Even during the preprocessing, we could have done some things differently, like making everything lowercase and removing punctuation. Be aware that every choice you make along the way trickles down into your pipeline and can have some effect on your results.
- 1
Credit: this teaching material is created by Robert van Straten under the supervision of Yen-Chia Hsu.