Lecture 7: Text Data Processing (Part I)
(Last updated: Feb 18, 2026)
This lecture introduces the theory for text data processing, including preprocessing (tokenization, lemmatization, pos-tagging), word embeddings, topic modeling, sequence-to-sequence modeling, and the attention mechanism.
Check the GenAI usage policy if you are using the course materials with GenAI for self-study and fact-checking.
Preparation
Read the required course readings.
Lecture
Below are the slides:
Required Course Readings
- Section 4.4.2.1 (Tokenisation), 4.4.2.2 (Stop-word Removal), 4.4.2.3 (Lemmatisation), 4.4.2.4 (Stemming), 4.4.3.1 (Part-of-speech Tagging) in the ML4Design lecture notes (Bozzon, 2023)
- Section 5.1 (Lexical Semantics), 5.2 (Vector Semantics: The Intuition), 5.3 (Simple count-based embeddings), 5.4 (Cosine for measuring similarity), 5.5 (Word2vec), 11.1.1 (Representing documents as vectors), and 11.1.2 (Term weighting: tf-idf and BM25) in book Speech and Language Processing (Jurafsky & Martin, 2026)
Optional Course Readings
- Section 5.5 (Maximum Likelihood Estimation) in book Deep Learning (Goodfellow et al., 2016).
- Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM.
- Section 8.1 (Attention) in book Speech and Language Processing (Jurafsky & Martin, 2026)
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Additional Resources
The following videos explain some math concepts that are used in the lecture.
A video that explains the attention mechanism: