Lecture 7: Text Data Processing (Part I)

(Last updated: Jan 30, 2025)

This lecture introduces the theory for text data processing, including preprocessing (tokenization, lemmatization, pos-tagging), word embeddings, topic modeling, sequence-to-sequence modeling, and the attention mechanism.

Preparation

Watch the following videos to understand some math concepts that will be used in this lecture:

Materials

Slides for Lecture 7: Text Data Processing

Additional Resources

A video that explains the attention mechanism:

Attention for Neural Networks, Clearly Explained!!!

The following paper explains Topic Modeling and the intuitions:

Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.

The following paper explains how to train word embeddings using Word2Vec:

Mikolov, T., et al. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

The following paper explains how to use the attention mechanism for document classification:

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016, June). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (pp. 1480-1489).

The Hugging Face website below documents a list of state-of-the-art Transformer-based models:

Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX