Topics in Natural Language Processing (202-2-5381) Fall 2021

Meets: Online (Zoom) - Sundays 10-12

News:

18 Oct 20: Welcome to NLP 21 - Lecture 1
25 Oct 20: Language Modeling - Lecture 2, Quiz #01
01 Nov 20: Intro to Deep Learning for NLP - Lecture 3, Quiz #02
08 Nov 20: Word Embeddings (1/2) - Lecture 4, Quizz #03
09 Nov 20: HW1 published - Due date 30 Nov 2020 - start from the template notebook (html).
15 Nov 20: Word Embeddings (2/2) - Lecture 4, Quizz #04
22 Nov 20: Basic Statistical Concepts for ML in NLP, Quizz 05
29 Nov 20: Sequence Classification, Quizz 06
29 Nov 20: Sparse Text Generation, by Pedro Henrique Martins, Zita Marinho, André F. T. Martins provides a good analysis of how perplexity and sampling from language models impacts on computation of perplexity and quality of text generation. It shows 3 variants of perplexity to deal with unknown words.
06 Dec 20: More on Sequence Classification (CRF), Quizz 07
06 Dec 20: Registration for HW1 grading - email me to reserve your preferred slot.
15 Dec 20: HW2 published - Due date 04 Jan 2021 - start from the template notebook (html).
20 Dec 20: More on Sequence Classification, Syntax (1/3). Quizz 08
27 Dec 20: Syntax: Constituent Parsing (2/3). Quizz 09
03 Jan 21: Syntax: Latent Probabilistic Constituent Parsing / Dependency Parsing (3/3). Quizz 10
03 Jan 21: Registration for HW2 grading - email me to reserve your preferred slot.
04 Jan 21: HW3 - Due date 25 Jan 2021 - start from the template notebook (html).
10 Jan 21: Semantic Parsing. Quizz 11
15 Jan 21: END OF SEMESTER
28 Jan 21: Exam 13:30-15:30

The course will be given online - links to Zoom sessions will be available in this page weekly. Each week there will be a short quizz on the material from the previous week (in the form of an online form). You are expected to read the material of the week marked with (*) before the lecture on Sunday morning.

General Intro to NLP - Linguistic Concepts
Language Modeling
Intro to Deep Learning for Neural Language Processing
Word Embeddings
Basic Statistical Concepts for ML in NLP
Sequence Classification (1/2) HMM
Sequence Classification (2/2): Log-linear models, MEMM and CRF
Syntactic Analysis (1/3): CFGs
Syntactic Analysis (2/3): PCFGs
Syntactic Analysis (3/3): Dependency Parsing
Semantic Parsing

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques and algorithms.

Understand what makes natural language complex to process automatically: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic structure (word, sentence, text; morphology, syntax, semantics, pragmatics) and their interaction.
Understand machine learning techniques as applied to text: supervised, semi-supervised and unsupervised methods; training vs. testing; classification; regression; basic distributions, KL-divergence; Bayesian methods; Perceptron; Deep Learning methods in NLP; RNNs, LSTMs, Sequence to Sequence models and attention and transformers.
Learn Natural Language Processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; language models; probabilistic context free grammars; probabilistic dependency grammars; text simplification; paraphrase detection; summarization; text generation; topic modelling and semantic parsing.

Topics covered in class include:

Descriptive linguistic models
Language Models -- Statistical Models of Unseen Data (n-gram, smoothing, recurrent neural networks language models)
Language Models and deep learning -- word embeddings, continuous representations, neural networks, sequence to sequence models, attention, transformers
Parts of speech tagging, morphology, non categorical phenomena in tagging
Information Extraction / Named Entity Recognition
Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM. CRF, Perceptron, Logistic Regression
Bayesian Statistics, generative models, topic models, LDA
Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
Semantic Parsing
Text Summarization

Lecture Notes

18 Oct 20: General Intro to NLP - Linguistic Concepts ZOOM Recorded session
Things to do:
1. Find a way to estimate how many words exist in English. Repeat in Hebrew. What method did you use? What definition of word did you use? (Think derivation vs. inflection).
2. Experiment with Google Translate: find ways to make Google Translate "fail dramatically" (generate very wrong translations). Explain your method and collect your observations. Document attempts you made that did NOT make Google Translate fail. (Think variability and ambiguity; Think syntactic complexity; think lexical sparsity, unknown words, divergence in morphology across languages).
3. Think of reasons why natural languages have evolved to become ambiguous (Think: what is the communicative function of language; who pays the cost for linguistic complexity and who benefits from it; is ambiguity created willingly or unconsciously?)

25 Oct 20: Language Modeling ZOOM Recorded session
1. * Complete reading material from Intro on sections: Ambiguity, Variability, Vagueness, Discrete and Sparse Data, Levels of Linguistic Description, Words: Tokens, Parts of Speech and Morphology, Sentences and Syntax
2. * N-gram Language Modeling Chapter 3 from Speech and Language Processing (SPL3), Jurafsky and Martin, 3rd Ed, 2016. (Focus on 3.1-3.4). If you prefer - the same material covered in slides format.
3. * Peter Norvig: How to Write a Spelling Corrector (2007).
  This is a toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods). Read an extended version of the material with more applications (word segmentation, n-grams, smoothing, more on bag of words, secret code decipher): How to Do Things with Words (Use this local copy adapted to Python 3 / html version and the support files: big.txt, count_1w.txt, count_2w.txt).
4. Spelling Correction and the Noisy Channel, Appendix B from SPL3, B.1-B.2. This covers similar material as Norvig's piece above in a more formal manner.
5. Things to do:
  1. * Read about Probability axioms and in more details in the notes from Fredrik Engstrom:
  2. Read about Edit Distance and in more details, a review of minimum edit distance algorithms using dynamic programming from Dan Jurafsky.
  3. * Install Python: I recommend installing the Anaconda distribution (choose the Python 3.8 version).
    The Anaconda distribution includes a large set of Python packages ready to use that we will find useful. (~500MB download, 3GB disk space needed.) In particular, Anaconda includes the nltk, pandas, numpy, scipy and scikit-learn packages.
    Alternatively, for people who prefer to have more control, install Miniconda (about 50MB download). Then launch the "miniconda shell" and execute the following command:
```
		% conda create -n nlp21 jupyter jupyterlab nltk matplotlib scikit-learn scipy 
		% conda activate nlp21
	
```
  4. Execute Norvig's spell checker for English (you will neeed the Python code from the article and the large file of text used for training big.txt).
  5. How many tokens are there in big.txt? How many distinct tokens? What are the 10 most frequent words in big.txt?
  6. In a very large corpus (discussed in the ngram piece quoted below), the following data is reported:
    The 10 most common types cover almost 1/3 of the tokens, the top 1,000 cover just over 2/3.
    What do you observe on the much smaller big.txt corpus?
  7. Read more from Norvig's piece on ngrams.
  8. Execute the word segmentation example from Norvig's ngram chapter (code in this notebook).
    Note the very useful definition of the @memo decorator in this example, which is an excellent method to implement dynamic programming algorithms in Python. From Python Syntax and Semantics:
    A Python decorator is any callable Python object that is used to modify a function, method or class definition. A decorator is passed the original object being defined and returns a modified object, which is then bound to the name in the definition. Python decorators were inspired in part by Java annotations, and have a similar syntax; the decorator syntax is pure syntactic sugar, using @ as the keyword:
```
	@viking_chorus
	def menu_item():
		print("spam")
		
```
    is equivalent to:
```
	def menu_item():
		print("spam")
	menu_item = viking_chorus(menu_item)
		
```
  9. This corpus includes a list of about 40,000 pairs of words (error, correction). It is too small to train a direct spell checker that would map word to word. Propose a way to learn a useful error model (better than the one used in Norvig's code) using this corpus. Hint: look at the model of weighted edit distance presented in Jurafsky's lecture cited above.

01 Nov 2020: Intro to Deep Learning Intro for Natural Language Processing Recorded ZOOM
Things to do:
1. Prove that a linear function from ℜn to ℜ cannot be bounded below. (See this post for reference)
2. * Learn Python if you don't know it (About 4 hours)
  - Python Tutorial (Use Python 3.9)
3. Install a good Python environment (About 2 hours) The default environment is pyCharm (the free Community Edition is good for our needs 127MB download). If you already have experienced with it, VSCode also has very good support for Python.
4. * Follow the NumPy quick start tutorial to learn how to manipulate arrays and tensors efficiently in Python. (2 hours) Check your knowledge with numpy exercises (1 hour).
5. * Install Pytorch in your environment (1 hour):
```
Install PyTorch: follow instructions from pytorch.org and follow the instructions for your OS (Linux, MacOS, Windows).
If you have an nVidia GPU on your machine, select the appropriate CUDA version that you have installed.
		
```
6. * Explore the PyTorch tutorials for text - starting from Deep Learning NLP Tutorial - cover the first 3 segments (3 hours).

08 Nov 20 - 15 Nov 20 Word Embeddings Recorded ZOOM 1/2 Recorded ZOOM 2/2
1. * View Chris Manning's Lecture on Word2Vec Winter 2019 (1h20) (see also cs224n site).
2. * Read corresponding slides
3. Execute the following PyTorch tutorials on dense word representations in deep learning:
  - * The Logistic regression bag of words model.
  - * Continuous bag of words model.
Further Reading::
1. Read Vector Semantics, Chapter 6 of Speech and Language Processing 3rd Ed, Jurafsky and Martin 2018.
2. Read Efficient Estimation of Word Representations in Vector Space from Mikolov et al (2013) and the Word2vec site.
3. Read Deep contextualized word representations, Peters et al NAACL 2018 (ELMO)
4. Read BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al NAACL 2019 (BERT).
5. Read Neural Word Embedding as Implicit Matrix Factorization by Levy and Goldberg, 2014.
6. Read king - man + woman is queen; but why? by Piotr Migdal, Jan 2017. Good summary of word embeddings with interactive visualization tools, including word2viz word analogies explorer.
7. FastText provides pre-trained word embeddings in many languages.
Things to do:
1. * Install Gensim in your environment (run "pip install gensim") and run the Gensim Word2vec tutorial.
2. Experiment with Sense2vec with spaCy and Gensim, Source code a tool to compute word embeddings taking into account multi-word expressions and POS tags.

22 Nov 2020: Basic Statistical Concepts for Supervised Machine Learning in NLP Recorded ZOOM
Things to do:
1. * Bayesian concept learning from Tenenbaum 1999 - reported in Murphy 2012 Chapter 3. (I also warmly recommend this slightly related ICML 2018 lecture of Josh Tenenbaum - 1h11 Building Machines that Learn & Think Like People).
2. * Read Deep Learning by Goodfellow, Bengio and Courville, 2016 Chapters 3 and 5. (About 4 hours)
3. * Watch the 15mn video (ML 7.1) Bayesian inference - A simple example by Mathematical Monk.
4. Read Statistical Data Analysis and specifically Introduction to Bayesian methods from Cyril Rossant, and execute the associated Jupyter notebooks. (About 1 hour)
5. Learn how to use Scipy and Numpy - Chapter 1 in Scipy Lectures (Focus on 1.4 and 1.6 - about 5 hours)
6. Write Python code using numpy, scipy and matplotlib.pyplot to draw the graphs of the Beta distribution that appear in the lecture notes (About 1 hour)
7. Given a dataset for a Bernouilli distribution (that is, a list of N bits), generate a sequence of N graphs illustrating the sequential update process, starting from a uniform prior until the Nth posterior distribution. Each graph indicates the distribution over μ, the parameter of the Bernouilli distribution (which takes value in the [0..1] range). (About 2 hours)
8. Learn how to draw Dirichlet samples using numpy.random.dirichlet. A sample from a Dirichlet distribution is a multinomial distribution. Understand the example from the Wikipedia article on Dirichlet distributions about string cutting:
```
   import numpy as np
   import matplotlib.pyplot as plt
   s = np.random.dirichlet((10, 5, 3), 20).transpose()
   plt.barh(range(20), s[0])
   plt.barh(range(20), s[1], left=s[0], color='g')
   plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
   plt.title("Lengths of Strings")
   plt.show()
   
```
  (About 2 hours)
9. Compute the MLE estimator μ_MLE of a binomial distribution Bin(m|N, μ).
10. Mixture Priors: assume we contemplate two possible modes for the value of our Beta-Binomial model parameter μ. A flexible method to encode this belief is to consider that our prior over the value of μ has the form:
```
   μ ~ k₁Beta(a, b) + k₂Beta(c, d)
   where k₁ + k₂ = 1
   m ~ Bin(μ N)
   
```
  A prior over μ of this form is called a mixture prior - as it is a linear combination of simple priors.
  1. Prove that the mixture prior is a proper probabilistic distribution.
  2. Compute the posterior density over μ for a dataset where (N = 10, m=8, N-m=2) where k₁=0.8 and k₂=0.2 and the prior distributions are Beta(1,10) and Beta(10,1). Write Python code to draw the prior density of μ and its posterior density. (About 2 hours)
  3. Experiment with a very simple form of Stochastic Gradient Descent (SGD) with a custom loss function by running this notebook. More examples available on the Autograd project homepage.
29 Nov 2020 Sequence Classification 1/2 Recorded ZOOM
1. * Read the Parts of Speech Tagging notebook (code). (3 hours).
2. Read about the Universal Dependencies definition.
3. * Read Chapter 6: Learning to Classify Text of the NLTK Book (About 2 hours).
4. * Read Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Tom Mitchell, 2015. (About 3 hours)
5. * Read Michael Collins's notes on Tagging Problems, and Hidden Markov Models: POS tagging and Named Entity Recognition as tagging problems (with BIO tag encoding), generative and noisy channel models, generative tagging models, trigram HMM, conditional independence assumptions in HMMs, estimating the parameters of an HMM, decoding HMMs with the Viterbi algorithm. (about 2 hours)
6. Watch (ML 8.1) Naive Bayes Classification a 15mn video on Naive Bayes Classification by Mathematical Monk and the following chapter (ML 8.3) about Bayesian Naive Bayes (20 minutes).
Practical work:
1. Explore the documentation of the nltk.classify module.
2. Read the code of the NLTK Naive Bayes classifier and run nltk.classify.naivebayes.demo()
3. Read the code of the NLTK classifier demos: names_demo and wsd_demo.
4. Read the documentation on feature extraction in Scikit-learn.
5. Run the example on document classification in Scikit-learn and the 20 newsgroup dataset using Logistic Regression.
6. Experiment with the example of classifications in this iPython notebook (code) which shows how to run NLTK classifiers in a variety of ways.
7. Experiment with the Reuters Dataset notebook (code) illustrating document classification with bag of words features and TF-IDF transformation.

06 Dec 20 - Sequence Classification 2/2 Recorded ZOOM
1. * Read Michael Collins's notes on Tagging Problems, and Hidden Markov Models
Practical Work:
1. Explain why the problem of decoding (see 2.5.4 in Tagging Problems, and Hidden Markov Models) requires a dynamic programming algorithm (Viterbi) while we did not need such a decoding step when we discussed classification using Logistic Regression and Naive Bayes?
2. Implement Algorithm 2.5 (Viterbi with backpointers) from Tagging Problems, and Hidden Markov Models in Python. Test it on the Brown POS tagging dataset using MLE for tag transitions estimation (parameters q) and a discounting language model for each tag in the Universal taget for parameters e(x|tag) for each tag.
3. Run Sequence Models and LSTM in PyTorch.

20 Dec 20 - Sequence Classification / Syntax (1/3) Recorded ZOOM
1. * Read Log-Linear Models, MEMMs, and CRFs by Michael Collins.
2. * Read Sequence Models and LSTM - PyTorch Tutorial.
3. * Context Free Grammars Parsing

27 Dec 20 - Syntax: Constituent Parsing (2/3) Recorded ZOOM

03 Jan 21 - Syntax: Latent Probabilistic Constituent Parsing / Dependency Parsing (3/3) Recorded ZOOM
1. * Probabilistic Context Free Grammars Parsing
2. * Michael Collins's lecture on Lexicalized PCFGs:
  1. Why CFGs are not adequate for describing treebanks: lack of sensitivity to lexical items + lack of sensitivity to structural preferences.
  2. How to lexicalize CFGs with Head propagation.
  3. How to parse a lexicalized PCFG.
3. * Dependency Parsing by Graham Neubig. Graham's teaching page with github page for exercises.
4. * Dependency Parsing: Past, Present, and Future, Chen, Li and Zhang, 2014 (Coling 2014 tutorial)
5. * NLTK Dependency Parsing Howto
6. Parsing English with 500 lines of Python, an implementation by Matthew Honnibal of Training Deterministic Parsers with Non-Deterministic Oracles, Yoav Goldberg and Joakim Nivre, TACL 2013. (Complete Python code)
7. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations, Eliyahu Kiperwasser and Yoav Goldberg, 2016, neural dependency parser.

10 Jan 21 - Semantic analysis, Recorded ZOOM

Software

We will use the following libraries during the semester:

NLTK: Nltk is a Python based toolkit with wide coverage of NLP techniques - both statistical and knowledge-based. It includes facilities to manipulate datasets.
Spacy: high-level, very fast Python library covering tokenization, named entity recognition, POS tagging, dependency parsing with high accuracy in many languages.
PyTorch - a deep learning framework in Python.
AllenNLP - an NLP research library built on top of PyTorch (with pre-trained models for common tasks - document encoding, tagging, parsing, semantic parsing).
huggingface Transformers - built on top of PyTorch - provides uniform access to Transformer-based models for text generation and understanding with many pretrained models providing high performance. Very good documentation and tutorials. Covers BERT, Roberta, T5, LXMERT, RAG, GPT2 and many more.
huggingface Tokenizers - a very fast trainable flexible text tokenization library.
huggingface Datasets - uptodate library to access NLP datasets commonly used in research - includes methods to load datasets efficiently, index datasets for fast search, and compute standard metrics.
Scikit-learn - a Python library for Machine Learning. Presents a uniform interface for many ML tasks (fit, transform). Good text processing example ( Working with text documents).

Resources

Notebooks Jupyter Notebooks are an effective medium to present computational methods. They combine text, mathematical formula, graphics and executable code. It is often possible to run notebooks in Google Colab.
- Practical Data Science in Python by Radim Rehurek. This is an iPython notebook demonstrating how to write classifiers in Python (using ScikitLearn). The concrete example is a spam detector on SMS messages.
- Out-of-core classification of text documents, scikit-learn example showing how to perform document classification on the Reuters-21578 database.
- SciPy Tutorials Comprehensive tutorial to Machine Learning using SciPy and ScikitLearn.
- Gensim tutorial, high-level tour of the Gensim features: streaming documents in an efficient manner, create and use topic models (LDA, LSI), use word vectors (word2vec), document vectors (doc2vec), tf*idf encoding of corpus, compute documents similarity and document summarization.
- Guide to NLP with AllenNLP.
- Spacy Linguistic Features - excellent review of linguistic annotations computed by Spacy (tokenization, subword units, POS tags, dependency trees, named entities and sentence segmentation).
- Online Courses
  - Noah Smith Natural Language Processing Course, University of Washington, Spring 2017, video lectures
  - Natural Language Processing, by Jason Eisner, John Hopkins University, Fall 2018
  - CIS 530 - Computational Linguistics, Chris Callison-Burch, University of Pennsylvania, Spring 2019
  - Applied Machine Learning, by Andreas Mueller, Columbia University, Spring 2018
  - CS224N: Natural Language Processing with Deep Learning, Chris Manning, Stanford University, Winter 2019
  - Neural Networks for NLP, by Graham Neubig, CMU, Spring 2019. This course has exceptional coverage of recent research with comprehensive list of references on each of the topics; it also presents hands-on code in Python for each of the topics.
  - NLP course, Dragomir Radev, Coursera, 2016 (80 video lectures of about 15 min each)
  - Natural Language Processing, Dan Jurafsky and Chris Manning, 2012, Stanford University (100 video lectures of about 15 min each)
  - Statistical Methods for NLP, by Joakim Nivre, Uppsala, 2012
- Books:
  - Jacob Eisenstein, Introduction to Natural Language Processing, online PDF draft, or MIT Press version (2019).
  - Speech and Language Processing (3rd ed. draft), by Dan Jurafsky and James H. Martin
  - Introduction to Natural Language Processing, by Steven Bird, Ewan Klein and Edward Loper, 2009, distributed on the NLTK site.
  - Neural Network Methods for Natural Language Processing, Yoav Goldberg, April 2017
  - Dive into Deep Learning, A. Zhang, Z. Lipton, and A. Smola, 2019. This is an interactive textbook with code based on NumPy covering deep learning in general, with good chapters on DL for NLP. Appendix on Math for DL is excellent.
  - Pattern Recognition and Machine Learning, Chris Bishop, 2007. Free download
  - Information Theory, Inference, and Learning Algorithms, David J.C. MacKay, 2003.
- Python
  1. Introduction to Python Generators
  2. Scikit Learn: Machine Learning in Python
Last modified 01 Jan 2021 Michael Elhadad

Topics in Natural Language Processing (202-2-5381) Fall 2021

News:

Contents

Objectives

Software