Michael Elhadad

Natural Language Processing (202-2-5211) Fall 2016

Meets:
Sun 12-14 Bdg 35 Room 211
Thu 10-12 Bdg 34 Room 114

News:

  1. 25 Feb 16: Moed B of the exam will take place on Feb 25 13:30 in Room 34:107 (for students who missed Moed A)
  2. 10 Jan 16: The exam will take place on Feb 07 17:30 in Room 34:10.
  3. 28 Dec 15: HW3 is available!.
  4. 27 Dec 15: Registration for HW2 Grading is open: send me email with the slot you would like to reserve.
  5. 14 Dec 15: HW2 is available
  6. 01 Dec 15: Registration for HW1 Grading is open: send me email with the slot you would like to reserve.
  7. 26 Nov 15: HW1 is available
  8. 25 Oct 15: Welcome to NLP 16!

Contents

  1. General Intro to NLP - Linguistic Concepts
  2. How to Write a Spelling Corrector by Peter Norvig - learning from data, noisy channel model
  3. Parts of speech Tagging
  4. Basic Statistic Concepts
  5. Classification
  6. Sequence Classification
  7. Word Embeddings
  8. Syntax and Parsing
  9. Summarization
  10. Topic Modeling
  11. Deep Learning for NLP

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.
  1. Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
  2. Acquire basic understanding of machine learning techniques as applied to language: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron; Deep Learning methods in NLP.
  3. Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; n-gram language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; Treebank analysis; Text simplification; Paraphrase detection; Summarization; Text generation.
Topics covered in class include:
  1. Descriptive linguistic models
  2. Language Models and n-grams -- Statistical Models of Unseen Data (Smoothing)
  3. Language Models and deep learning -- word embeddings, continuous representations, neural networks
  4. Parts of speech tagging, morphology, non categorical phenomena in tagging
  5. Information Extraction / Named Entity Recognition
  6. Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM. CRF, Perceptron
  7. Bayesian Statistics, generative models, topic models, LDA
  8. Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
  9. Compositional Semantic from CFG Parsing
  10. Text Summarization
  11. Text Generation


Lecture Notes
  1. 25 Oct 15: General Intro to NLP - Linguistic Concepts

    Things to do:

    1. Find a way to estimate how many words exist in English. In Hebrew. What method did you use? What definition of word did you use? (Think derivation vs. inflection)
    2. Experiment with Google Translate: find ways to make Google Translate "fail dramatically" (generate very wrong translations). Explain your method and collect your observations. Document attempts you made that did NOT make Google Translate fail. (Think variability and ambiguity.)
    3. Think of reasons why natural languages have evolved to become ambiguous.


  2. 01 Nov 15: Peter Norvig: How to Write a Spelling Corrector (2007).

    This is a toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods).

    Things to do:

    1. Read about Probability axioms.
    2. Read about Edit Distance and in more details, a review of minimum edit distance algorithms using dynamic programming from Dan Jurafsky.
    3. Install Python: I recommend installing the Anaconda distribution (choose the Python 3.5 version). (Note: Many of the code samples you will see are written in Python 2 - which is not exactly compatible with Python 3 - the main annoying difference is that in Python 2 you can write: print x -- in Python 3 it must be print(x).]

      The Anaconda distribution includes a large set of Python packages ready to use that we will find useful. (367MB download, 2GB disk space needed.) In particular, Anaconda includes the nltk, pandas, numpy, scipy and scikit-learn packages.

    4. Execute Norvig's spell checker for English (you will neeed spell.py (this version is adapted to Python 3 spell.py) and the large file of text used for training big.txt).
    5. How many tokens are there in big.txt? How many distinct tokens? What are the 10 most frequent words in big.txt?
    6. In a very large corpus (discussed in the ngram piece quoted below), the following data is reported:
      The 10 most common types cover almost 1/3 of the tokens, the top 1,000 cover just over 2/3.
      What do you observe on the much smaller big.txt corpus?
    7. You can read more from Norvig's piece on ngrams.
    8. Execute the word segmentation example from Norvig's ngram chapter (code in ngrams.py).

      Note the very useful definition of the @memo decorator in this example, which is an excellent method to implement dynamic programming algorithms in Python. From Python Syntax and Semantics:

      A Python decorator is any callable Python object that is used to modify a function, method or class definition. A decorator is passed the original object being defined and returns a modified object, which is then bound to the name in the definition. Python decorators were inspired in part by Java annotations, and have a similar syntax; the decorator syntax is pure syntactic sugar, using @ as the keyword:
      @viking_chorus
      def menu_item():
          print("spam")
      	
      is equivalent to:
      def menu_item():
          print("spam")
      menu_item = viking_chorus(menu_item)
      	
    9. This corpus includes a list of about 40,000 pairs of words (error, correction). It is too small to train a direct spell checker that would map word to word. Propose a way to learn a useful error model (better than the one used in Norvig's code) using this corpus. Hint: look at the model of weighted edit distance presented in Jurafsky's lecture cited above.


  3. 05-12 Nov 2015: Parts of speech Tagging (3 lectures)

    Things to do:

    1. Learn Python:
    2. Install a good Python environment: The default environment is pyCharm (the free Community Edition is good for our needs 127MB download).

      Hackers may want to get deeper in mastering Python's environment: Python's ecosystem.

    3. Install NLTK: if you have installed Anaconda, it is already installed. Make sure to download the corpora included with nltk.

      Q: How do you find out where your package is installed after you use easy_install?

      A: in the Python shell, type: import nltk; then type: nltk. You will get an answer like:

      >>> import nltk
      >>> nltk
      <module 'nltk' from 'C:\Anaconda\lib\site-packages\nltk\__init__.py'>
      >>>
         
    4. Explore the Brown corpus of parts-of-speech tagged English text using NLTK's corpus reader and FreqDist object:
      • What are the 10 most common words in the corpus?
      • What are the 5 most common tags in the corpus?
    5. Read Chapter 5 of the NLTK book
    6. Advanced topics in POS tagging: we will get back to the task of POS tagging with different methods in the following chapters, for more advanced sequenced labeling methods (HMM), feature-based classifier methods for tagging (CRF), and as a test case for unsupervised EM techniques and Bayesian techniques. You can look at the source code of the nltk.tag module for a feeling of how the tag.hmm, tag.crf and tag.tnt methods are implemented.

      The following papers give a good feeling of the current state of the art in POS tagging:

    7. Read A good POS tagger in 200 lines of Python, an Averaged Perceptron implementation with good features, fast, reaches 97% accuracy (by Matthew Honnibal).


  4. 15-22 Nov 2015: Basic Statistic Concepts (3 lectures)

    Things to do:

    1. Watch the 15mn video (ML 7.1) Bayesian inference - A simple example by Mathematical Monk.
    2. Write a Python generator that generates all 2n subsets given a set of n elements.
    3. Make sure you have installed numpy and scipy in your Python environment. Easiest way is to use the Anaconda distribution. Else, download the setup for scipy in scipy at sourceforge -- choose the precompiled binaries called "superpacks".
    4. Write Python code using numpy, scipy and matplotlib.pyplot to draw the graphs of the Beta distribution that appear in the lecture notes.
    5. Given a dataset for a Bernouilli distribution (that is, a list of N bits), generate a sequence of N graphs illustrating the sequential update process, starting from a uniform prior until the Nth posterior distribution. Each graph indicates the distribution over μ, the parameter of the Bernouilli distribution (which takes value in the [0..1] range).
    6. Learn how to draw Dirichlet samples using numpy.random.mtrand.dirichlet. A sample from a Dirichlet distribution is a multinomial distribution. Understand the example from the Wikipedia article on Dirichlet distributions about string cutting:
         import numpy as np
         import matplotlib.pyplot as plt
         s = np.random.dirichlet((10, 5, 3), 20).transpose()
         plt.barh(range(20), s[0])
         plt.barh(range(20), s[1], left=s[0], color='g')
         plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
         plt.title("Lengths of Strings")
         plt.show()
         
    7. Compute the MLE estimator μMLE of a binomial distribution Bin(m|N, μ).
    8. Mixture Priors: assume we contemplate two possible modes for the value of our Beta-Binomial model parameter μ. A flexible method to encode this belief is to consider that our prior over the value of μ has the form:
         μ ~ k1Beta(a, b) + k2Beta(c, d)
         where k1 + k2 = 1
         m ~ Bin(μ N)
         
      A prior over μ of this form is called a mixture prior - as it is a linear combination of simple priors.
      1. Prove that the mixture prior is a proper probabilistic distribution.
      2. Compute the posterior density over μ for a dataset where (N = 10, m=8, N-m=2) where k1=0.8 and k2=0.2 and the prior distributions are Beta(1,10) and Beta(10,1). Write Python code to draw the prior density of μ and its posterior density.


  5. 26-03 Nov-Dec 15 Classification (3 lectures)

    1. Read Chapter 6: Learning to Classify Text of the NLTK Book.
    2. Read Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Tom Mitchell, 2015.
    3. Watch (ML 8.1) Naive Bayes Classification a 15mn on Naive Bayes Classification by Mathematical Monk and the following chapter (ML 8.3) about Bayesian Naive Bayes.
    4. Read and execute the tutorial on Using Theano for Logistic Regression
    Practical work:
    1. Explore the documentation of the nltk.classify module.
    2. Read the code of the NLTK Naive Bayes classifier and run nltk.classify.naivebayes.demo()
    3. Read the code of the NLTK classifier demos: names_demo and wsd_demo.
    4. Read the documentation on feature extraction in Scikit-learn.
    5. Run the example on document classification in Scikit-learn.
    6. Experiment with the example of classifications in this iPython notebook (code) which shows how to run NLTK classifiers in a variety of ways.
    7. Experiment with the Reuters Dataset notebook (code) illustrating document classification with bag of words features and TF-IDF transformation.
    8. Experiment with a very simple form of Stochastic Gradient Descent (SGD) with a custom loss function by running this notebook. (Notebook source here). More examples available on the Autograd project homepage.
    9. The Theano tutorial on Logistic Regression is applied to a vision task (MNIST hand-written digit recognition). Apply the Theano classes to the task of Text Classification using the same dataset as Scikit-learn tutorial on text classification on the 20 newsgroup dataset.
    10. (Advanced) From Logistic regression to deep nets is a step by step notebook illustrating how to modify a Logistic Regression classifier into a deep net with good explanation of regularization, SGD, and back-propagation. It uses a simplified dataset of the MNIST digits dataset called the "small digits dataset".


  6. 03-06 Dec 15 Sequence Classification (2 lectures)

    1. Read Michael Collin's notes on Language Modeling: Markov models for fixed length sequences, for variable length sequences, trigram language models, MLE estimates, perplexity over n-gram models, smoothing of n-gram estimates with linear interpolation.
    2. Read Michael Collin's nodes on Tagging Problems, and Hidden Markov Models: POS tagging and Named Entity Recognition as tagging problems (with BIO tag encoding), generative and noisy channel models, generative tagging models, trigram HMM, conditional independence assumptions in HMMs, estimating the parameters of an HMM, decoding HMMs with the Viterbi algorithm.

    Things to do:

    1. Implement the bigram and trigram language model described in Language Modeling with the discounting method described in 1.4.2 in Python. Test it on the nltk.corpus.gutenberg dataset split as training, development and test of 80%, 10% and 10%.

      Optimize the value of the β parameter of the method on the development set. Compare the perplexity of the bigram and trigram models on the test dataset (as defined in 1.3.3).

    2. Explain why the problem of decoding (see 2.5.4 in Tagging Problems, and Hidden Markov Models) requires a dynamic programming algorithm (Viterbi) while we did not need such a decoding step in the previous chapter when we discussed Logistic Regression and Naive Bayes?
    3. Implement Algorithm 2.5 (Viterbi with backpointers) from Tagging Problems, and Hidden Markov Models in Python. Test it on the Brown POS tagging dataset using MLE for tag transitions estimation (parameters q) and a discounting language model for each tag in the Universal taget for parameters e(x|tag) for each tag.
    4. Do the Assignment 3 from Richard Johannson's course on Machine Learning for NLP, 2014. Read the material assignment 3 material and Lecture 6: predicting structured objects.

      Start from the excellent Python implementation of the structured perceptron algorithm.


  7. 10-17 Dec 15 Word Embeddings (2 lectures)

    1. Read CS 224D: Deep Learning for NLP1 1 - Lecture Notes: Part I by Richard Socher, 2015 and the links of the first chapter in Deep Learning for NLP course.
    2. Read Efficient Estimation of Word Representations in Vector Space from Mikolov et al (2013) and the Word2vec site.
    3. Read TensorFlow's tutorial on Word Embeddings. Note: If you want to run the code, TensorFlow runs only on Python 2.7 (not 3.x) and only on Linux or MacOS (not Windows).
    4. Read Word2Vec Explained by Yoav Goldberg and Omer Levy, 2014.
    5. Read Neural Word Embedding as Implicit Matrix Factorization by Levy and Goldberg, 2014.
    6. Read Improving Distributional Similarity with Lessons Learned from Word Embeddings by Levy, Goldberg and Dagan, 2015.
    7. Read Linguistic Regularities in Continuous Space Word Representations by Mikolov et al, 2013.

    Things to do:

    1. Install Gensim in your environment (run "conda install gensim") and run the Gensim Word2vec tutorial.
    2. Experiment with the Kaggle competition for using Google's word2vec package for sentiment analysis.


  8. 20 Dec-03 Jan 16 Syntax and Parsing (5 lectures)
    1. Context Free Grammars Parsing
    2. Probabilistic Context Free Grammars Parsing
    3. Michael Collins's lecture on CFGs and CKY
    4. Michael Collins's lecture on Lexicalized PCFGs:
      1. Why CFGs are not adequate for describing treebanks: lack of sensitivity to lexical items + lack of sensitivity to structural preferences.
      2. How to lexicalize CFGs with Head propagation.
      3. How to parse a lexicalized PCFG.
    5. NLTK tools for PCFG parsing
    6. Notes on computing KL-divergence
    7. Dependency Parsing:
      1. Dependency Parsing by Graham Neubig. Graham's teaching page with github page for exercises.
      2. Dependency Parsing: Past, Present, and Future, Chen, Li and Zhang, 2014 (Coling 2014 tutorial)
      3. NLTK Dependency Parsing Howto
      4. Parsing English with 500 lines of Python, an implementation by Matthew Honnibal of Training Deterministic Parsers with Non-Deterministic Oracles, Yoav Goldberg and Joakim Nivre, TACL 2013. (Complete Python code)
      5. Neural Network Dependency Parser, Chen and Manning 2014. A Java implementation of a Neural Network Dependency Parser with Unlabelled accuracy of 92% and Labelled accuracy of 90%.

    8. 07-10 Jan 16 Summarization (2 lectures)
      1. Automatic Text Summarization
      2. A Survey of Text Summarization Techniques, Ani Nenkova and Kathleen McKeown, Mining Text Data, 2012 - Springer


    9. 14 Jan 16 Topic Modeling and Latent Dirichlet Allocation
      1. David Blei's Lecture on LDA Sept 2009, Part 1 (1h30) and Part 2 (1h30)
      2. Slides of Blei's lecture


    10. 17 Jan 16 Deep Learning for NLP
      1. A Primer on Neural Network Models for Natural Language by Yoav Goldberg, Oct 2015
      2. Natural Language Understanding with Distributed Representation by Kyunghyun Cho, Nov 2015
      3. Deep Learning by Goodfellow, Bengio and Courville, 2016
      4. The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, May 2015 and the analysis of the same data The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool) by Yoav Goldberg, June 2015
      5. Calculus on Computational Graphs: Backpropagation, by Chris Olah, Aug 2015.
      6. Understanding LSTM Networks, by Chris Olah, Aug 2015.
      7. WildML articles by Denny Britz - Sep 2015 - Jan 2016

        These include tutorials and Python notebooks of incremental complexity covering topics in Deep learning for NLP.

        1. Implementing a Neural Network from Scratch in Python - an Introduction
        2. Speeding up your Neural Network with Theano and the GPU
        3. Recurrent Neural Networks Tutorial, Part 1 - Introduction to RNNs
        4. Recurrent Neural Networks Tutorial, Part 2 - Implementing a RNN with Python, Numpy and Theano
        5. Recurrent Neural Networks Tutorial, Part 3 - Backpropagation Through Time and Vanishing Gradients
        6. Recurrent Neural Network Tutorial, Part 4 - Implementing a GRU/LSTM RNN with Python and Theano
        7. Understanding Convolutional Neural Networks for NLP
        8. Implementing a CNN for Text Classification in TensorFlow
        9. Attention and Memory in Deep Learning and NLP

      Software
      • NLTK: Nltk is a Python based toolkit with wide coverage of NLP techniques - both statistical and knowledge-based.

      • SISC Scheme Interpreter: we use Scheme examples to demonstrate algorithms in parsing, generation and some semantic analysis. This interpreter is very small and convenient to use on any platform supporting Java (full version is 2.4MB with full doc - jar is 300KB).
      • Theano - a Python library for Deep Learning.
      • Torch - a LUA library for Deep Learning.
      • TensorFlow - a Python library for Deep Learning.

      Resources

      Last modified 21 Feb 2016 Michael Elhadad