Michael Elhadad

Natural Language Processing (202-2-5211) Fall 2013

Meets:
Sun 12-14 Bdg 34 Room 205
Thu 10-12 Bdg 34 Room 005

News:

  1. 03 Nov 12: HW1 is available.
  2. 16 Nov 12: Due to the security situation, HW1 submission is postponed by one week to Sun 25 Nov. Please send questions by email as needed.
  3. 29 Nov 12: HW2 is available.
  4. 13 Dec 12: Submission date of HW2 is extended to Dec 23rd.
  5. 23 Dec 12: Please request slots for frontal checking of HW1 and HW2: slots.
  6. 08 Jan 13: HW3 is available.
  7. 16 Jan 13: Added links with references to evalb metric in HW3.
  8. 16 Jan 13: There will be NO classes on Thursday 17 Jan and Sunday 20 Jan.

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.
  1. Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
  2. Acquire basic understanding of machine learning techniques as applied to language: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron;
  3. Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; n-gram language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; Treebank analysis; Text simplification; Paraphrase detection; Summarization; Text generation.
Topics covered in class include:
  1. Descriptive linguistic models
  2. Language Models and n-grams -- Statistical Models of Unseen Data (Smoothing)
  3. Parts of speech tagging, morphology
  4. Information Extraction / Named Entity Recognition
  5. Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM and CRF
  6. Bayesian Statistics, generative models, topic models, LDA
  7. Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
  8. Compositional Semantic from CFG Parsing
  9. Text Summarization
  10. Sentence Simplification
  11. Text Generation

Lecture Notes
  1. General Intro to NLP - Linguistic Concepts
  2. Peter Norvig: How to Write a Spelling Corrector (2007) - toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods).
  3. Parts of speech Tagging

    Things to do:

    1. Learn Python:
    2. Install a good Python environment: Python's ecosystem.
    3. Experiment with parts of speech tagging of English text online: NLTK APIs demo
    4. Install NLTK: Notes for 2.7

      Q: How do you find out where your package is installed after you use easy_install?

      A: in the Python shell, type: import nltk; then type: nltk. You will get an answer like:

      >>> import nltk
      >>> nltk
      <module 'nltk' from 'C:\python27\lib\site-packages\nltk\__init__.py'>
      >>>
         
    5. Explore the Brown corpus of parts-of-speech tagged English text using NLTK's corpus reader and FreqDist object:
      • What are the 10 most common words in the corpus?
      • What are the 5 most common tags in the corpus?
    6. Read Chapter 5 of the NLTK book
    7. Advanced topics in POS tagging: we will get back to the task of POS tagging with different methods in the following chapters, for more advanced sequenced labeling methods (HMM), feature-based classifier methods for tagging (CRF), and as a test case for unsupervised EM techniques and Bayesian techniques. You can look at the source code of the nltk.tag module for a feeling of how the tag.hmm, tag.crf and tag.tnt methods are implemented.

      The following papers give a good feeling of the current state of the art in POS tagging:

  4. Basic Statistic Concepts

    Things to do to check you are following:

    1. Watch the 15mn video (ML 7.1) Bayesian inference - A simple example by Mathematical Monk.
    2. Write a Python generator that generates all 2n subsets given a set of n elements.
    3. Make sure you have installed numpy and scipy in your Python environment. Easiest way is to download the setup for scipy in scipy at sourceforge -- choose the precompiled binaries called "superpacks"
    4. Write Python code using numpy, scipy and matplotlib.pyplot to draw the graphs of the Beta distribution that appear in the lecture notes.
    5. Given a dataset for a Bernouilli distribution (that is, a list of N bits), generate a sequence of N graphs illustrating the sequential update process, starting from a uniform prior until the Nth posterior distribution. Each graph indicates the distribution over μ, the parameter of the Bernouilli distribution (which takes value in the [0..1] range).
    6. Learn how to draw Dirichlet samples using numpy.random.mtrand.dirichlet. A sample from a Dirichlet distribution is a multinomial distribution. Understand the example from the Wikipedia article on Dirichlet distributions about string cutting:
         import numpy as np
         import matplotlib.pyplot as plt
         s = np.random.dirichlet((10, 5, 3), 20).transpose()
         plt.barh(range(20), s[0])
         plt.barh(range(20), s[1], left=s[0], color='g')
         plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
         plt.title("Lengths of Strings")
         plt.show()
         

  5. Classification

    1. Read Chapter 6: Learning to Classify Text of the NLTK Book.
    2. Read Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Tom Mitchell, 2010.
    3. Watch (ML 8.1) Naive Bayes Classification a 15mn on Naive Bayes Classification by Mathematical Monk and the following chapter (ML 8.3) about Bayesian Naive Bayes.
    Practical work:
    1. Explore the documentation of the nltk.classify module.
    2. Read the code of the NLTK Naive Bayes classifier and run nltk.classify.naivebayes.demo()
    3. Read the code of the NLTK classifier demos: names_demo and wsd_demo.
    4. Install the scikit-learn library (under Windows, the easiest installation method is to use the Windows Installer.
    5. Read the documentation on feature extraction in Scikit-learn.
    6. Run the example on document classification in Scikit-learn.
    7. Read the source code of the NLTK Scikit-learn integration.

  6. Parsing

    1. Context Free Grammars Parsing
    2. Probabilistic Context Free Grammars Parsing
    3. Notes on computing KL-divergence
    4. NLTK tools for PCFG parsing

  7. Automatic Text Summarization

  8. Semantic Analysis

    1. First Order Logic
    2. Compositional Semantic Analysis with Declarative Clause Grammars
    3. More on DCGs
    4. Some complex syntactic constructs in English and their DCG semantic interpretation
    5. Building a Question-Answering system using semantic interpretation

Software
Resources

Last modified Jan 08th, 2013 Michael Elhadad