Natural Language Processing (202-2-5211) Fall 2015

Meets:
Sun 12-14 Bdg 72 Room 213
Thu 10-12 Bdg 34 Room 003

News:

15 Jan 15: Don't forget to attend the NLP Bohan on Sunday 18 Jan 12-14
29 Dec 14: HW3 is available.
28 Dec 14: Registration for HW2 Grading is open: send me email with the slot you would like to reserve.
18 Dec 14: No class on Thursday 18 December
15 Dec 14: HW2 deadline extented to Sun Dec 21st midnight
07 Dec 14: In HW2 Q2.2, you should use the method conll2002.iob_sents() to access the data tagged with Begin, Inside, Outside tags for the chunks.
01 Dec 14: Registration for HW1 Grading is open: send me email with the slot you would like to reserve.
01 Dec 14: HW2 is available.
13 Nov 14: In HW1, you should use the 'universal tagset' (introduced in NLTK 3.0 instead of the "simplified tagset") discussed in the lecture notes.
```
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
```
The Universal Part-of-Speech Tagset is described in this table:
- ADJ (adjective) new, good, high, special, big, local
- ADP (adposition) on, of, at, with, by, into, under
- ADV (adverb) really, already, still, early, now
- CONJ (conjunction) and, or, but, if, while, although
- DET (determiner, article) the, a, some, most, every, no, which
- NOUN (noun) year, home, costs, time, Africa
- NUM (numeral) twenty-four, fourth, 1991, 14:24
- PRT (particle) at, on, out, over per, that, up, with
- PRON (pronoun) he, their, her, its, my, I, us
- VERB (verb) is, say, told, given, playing, would
- . (punctuation marks) . , ; !
- X (other) ersatz, esprit, dunno, gr8, univeristy
12 Nov 14: Added an example of iPython Notebook showing a plot to get you started.
11 Nov 14: Updated google.py for HW1 to make it compatible with nltk 3.0.
11 Nov 14: Added an example of iPython Notebook to get you started.
10 Nov 14: HW1 is available.
26 Oct 14: Welcome to NLP 15!

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.

Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
Acquire basic understanding of machine learning techniques as applied to language: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron; Deep Learning methods in NLP.
Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; n-gram language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; Treebank analysis; Text simplification; Paraphrase detection; Summarization; Text generation.

Topics covered in class include:

Descriptive linguistic models
Language Models and n-grams -- Statistical Models of Unseen Data (Smoothing)
Parts of speech tagging, morphology
Information Extraction / Named Entity Recognition
Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM. CRF, Perceptron
Bayesian Statistics, generative models, topic models, LDA
Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
Compositional Semantic from CFG Parsing
Text Summarization
Sentence Simplification
Text Generation

Lecture Notes

26 Oct 14: General Intro to NLP - Linguistic Concepts
Things to do:
1. Experiment with Google Translate: find ways to make Google Translate "fail dramatically" (generate very wrong translations). Explain your method and collect your observations. Document attempts you made that did NOT make Google Translate fail.
2. Think of reasons why natural languages have evolved to become ambiguous.

30 Oct 14: Peter Norvig: How to Write a Spelling Corrector (2007) - toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods).
Things to do:
1. Read about Probability axioms.
2. Read about Edit Distance and in more details, a review of minimum edit distance algorithms using dynamic programming from Dan Jurafsky.
3. Install Python: I recommend installing the Anaconda distribution (choose the Python 2.7 version).
  The Anaconda distribution includes a large set of Python packages ready to use that we will find useful. (367MB download, 2GB disk space needed.) In particular, Anaconda includes the nltk package and the numpy and scipy packages.
4. Execute Norvig's spell checker for English (you will neeed spell.py and the large file of text used for training big.txt).
5. How many tokens are there in big.txt? How many distinct tokens? What are the 10 most frequent words in big.txt?
6. In a very large corpus (discussed in the ngram piece quoted below), the following data is reported:
  The 10 most common types cover almost 1/3 of the tokens, the top 1,000 cover just over 2/3.
  What do you observe on the much smaller big.txt corpus?
7. You can read more from Norvig's piece on ngrams.
8. Execute the word segmentation example from Norvig's ngram chapter (code in ngrams.py).
  Note the very useful definition of the @memo decorator in this example, which is an excellent method to implement dynamic programming algorithms in Python. From Python Syntax and Semantics:
  A Python decorator is any callable Python object that is used to modify a function, method or class definition. A decorator is passed the original object being defined and returns a modified object, which is then bound to the name in the definition. Python decorators were inspired in part by Java annotations, and have a similar syntax; the decorator syntax is pure syntactic sugar, using @ as the keyword:
```
@viking_chorus
def menu_item():
    print("spam")
	
```
  is equivalent to:
```
def menu_item():
    print("spam")
menu_item = viking_chorus(menu_item)
	
```
9. This corpus includes a list of about 40,000 pairs of words (error, correction). It is too small to train a direct spell checker that would map word to word. Propose a way to learn a useful error model (better than the one used in Norvig's code) using this corpus. Hint: look at the model of weighted edit distance presented in Jurafsky's lecture cited above.

02-09 Nov 2014: Parts of speech Tagging (3 lectures)
Things to do:
1. Learn Python:
  - Python Tutorial (Use Python 2.7)
  - Google intro to Python
2. Install a good Python environment: The default environment is pyCharm (the free Community Edition is good for our needs 127MB download).
  Hackers may want to get deeper in mastering Python's environment: Python's ecosystem.
3. Experiment with parts of speech tagging of English text online: NLTK APIs demo
4. Install NLTK: if you have installed Anaconda, it is already installed. Make sure to download the corpora included with nltk: Notes for 2.7.
  Q: How do you find out where your package is installed after you use easy_install?
  A: in the Python shell, type: import nltk; then type: nltk. You will get an answer like:
```
>>> import nltk
>>> nltk
<module 'nltk' from 'C:\python27\lib\site-packages\nltk\__init__.py'>
>>>
   
```
5. Explore the Brown corpus of parts-of-speech tagged English text using NLTK's corpus reader and FreqDist object:
  - What are the 10 most common words in the corpus?
  - What are the 5 most common tags in the corpus?
6. Read Chapter 5 of the NLTK book
7. Advanced topics in POS tagging: we will get back to the task of POS tagging with different methods in the following chapters, for more advanced sequenced labeling methods (HMM), feature-based classifier methods for tagging (CRF), and as a test case for unsupervised EM techniques and Bayesian techniques. You can look at the source code of the nltk.tag module for a feeling of how the tag.hmm, tag.crf and tag.tnt methods are implemented.
  The following papers give a good feeling of the current state of the art in POS tagging:
  - A Universal Part-of-Speech Tagset by Slav Petrov, Dipanjan Das and Ryan McDonald, LREC, 2012.
  - Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections by Dipanjan Das and Slav Petrov, ACL, 2011.
  - See this accompanying slides presentation.
8. Read A good POS tagger in 200 lines of Python, an Averaged Perceptron implementation with good features, fast, reaches 97% accuracy (by Matthew Honnibal).

13-20 Nov 2014: Basic Statistic Concepts (3 lectures)
Things to do:
1. Watch the 15mn video (ML 7.1) Bayesian inference - A simple example by Mathematical Monk.
2. Write a Python generator that generates all 2ⁿ subsets given a set of n elements.
3. Make sure you have installed numpy and scipy in your Python environment. Easiest way is to use the Anaconda distribution. Else, download the setup for scipy in scipy at sourceforge -- choose the precompiled binaries called "superpacks".
4. Write Python code using numpy, scipy and matplotlib.pyplot to draw the graphs of the Beta distribution that appear in the lecture notes.
5. Given a dataset for a Bernouilli distribution (that is, a list of N bits), generate a sequence of N graphs illustrating the sequential update process, starting from a uniform prior until the Nth posterior distribution. Each graph indicates the distribution over μ, the parameter of the Bernouilli distribution (which takes value in the [0..1] range).
6. Learn how to draw Dirichlet samples using numpy.random.mtrand.dirichlet. A sample from a Dirichlet distribution is a multinomial distribution. Understand the example from the Wikipedia article on Dirichlet distributions about string cutting:
```
   import numpy as np
   import matplotlib.pyplot as plt
   s = np.random.dirichlet((10, 5, 3), 20).transpose()
   plt.barh(range(20), s[0])
   plt.barh(range(20), s[1], left=s[0], color='g')
   plt.barh(range(20), s[2], left=s[0]+s[1], color='r')
   plt.title("Lengths of Strings")
   plt.show()
   
```
7. Compute the MLE estimator μ_MLE of a binomial distribution Bin(m|N, μ).
8. Mixture Priors: assume we hesitate about two possible modes for the value of our Beta-Binomial model parameter μ. A flexible method to encode this belief is to consider that our prior over the value of μ has the form:
```
   μ ~ k₁Beta(a, b) + k₂Beta(c, d)
   where k₁ + k₂ = 1
   m ~ Bin(μ N)
   
```
  A prior over μ of this form is called a mixture prior - as it is a linear combination of simple priors.
  1. Prove that the mixture prior is a proper distribution.
  2. Compute the posterior density over μ for a dataset where (N = 10, m=8, N-m=2) where k₁=0.8 and k₂=0.2 and the prior distributions are Beta(1,10) and Beta(10,1). Write Python code to draw the prior density of μ and its posterior density.
23-30 Nov 14 Classification (3 lectures)
1. Read Chapter 6: Learning to Classify Text of the NLTK Book.
2. Read Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression, Tom Mitchell, 2010.
3. Watch (ML 8.1) Naive Bayes Classification a 15mn on Naive Bayes Classification by Mathematical Monk and the following chapter (ML 8.3) about Bayesian Naive Bayes.
Practical work:
1. Explore the documentation of the nltk.classify module.
2. Read the code of the NLTK Naive Bayes classifier and run nltk.classify.naivebayes.demo()
3. Read the code of the NLTK classifier demos: names_demo and wsd_demo.
4. Install the scikit-learn library (under Windows, the easiest installation method is to use the Windows Installer.
5. Read the documentation on feature extraction in Scikit-learn.
6. Run the example on document classification in Scikit-learn.
7. Read the source code of the NLTK Scikit-learn integration.
8. Experiment with the example of classifications in this iPython notebook which shows how to run NLTK classifiers in a variety of ways.

04 Dec 14 Sequence Classification (1 lecture)
1. Read Michael Collin's notes on Language Modeling: Markov models for fixed length sequences, for variable length sequences, trigram language models, MLE estimates, perplexity over n-gram models, smoothing of n-gram estimates with linear interpolation.
2. Read Michael Collin's nodes on Tagging Problems, and Hidden Markov Models: POS tagging and Named Entity Recognition as tagging problems (with BIO tag encoding), generative and noisy channel models, generative tagging models, trigram HMM, conditional independence assumptions in HMMs, estimating the parameters of an HMM, decoding HMMs with the Viterbi algorithm.

07 Dec 14 Parsing (5 lectures)

01 Jan 15 Dependency Parsing (1 lecture)
1. Dependency Parsing by Graham Neubig.
2. Dependency Parsing: Past, Present, and Future, Chen, Li and Zhang, 2014 (Coling 2014 tutorial)
3. NLTK Dependency Parsing Howto
4. Parsing English with 500 lines of Python, an implementation by Matthew Honnibal of Training Deterministic Parsers with Non-Deterministic Oracles, Yoav Goldberg and Joakim Nivre, TACL 2013. (Complete Python code)
5. Neural Network Dependency Parser, Chen and Manning 2014. A Java implementation of a Neural Network Dependency Parser with Unlabelled accuracy of 92% and Labelled accuracy of 90%.

04 Jan 15 - 11 Jan Automatic Text Summarization (3 lectures)

Semantic Analysis
1. First Order Logic
2. Analyzing the Meaning of Sentences Chapter 10 of the NLTK book
3. NLTK Semantics HowTo: demonstrates NLTK classes usage for manipulating first-order models.
4. Compositional Semantic Analysis with Declarative Clause Grammars
5. More on DCGs
6. Some complex syntactic constructs in English and their DCG semantic interpretation
7. Building a Question-Answering system using semantic interpretation

Software

NLTK: Nltk is a Python based toolkit with wide coverage of NLP techniques - both statistical and knowledge-based.
How to install nltk with Python 2.7.
SISC Scheme Interpreter: we use Scheme examples to demonstrate algorithms in parsing, generation and some semantic analysis. This interpreter is very small and convenient to use on any platform supporting Java (full version is 2.4MB with full doc - jar is 300KB).
NLTK APIs demo: online tests of some of NLTK tools.

Resources

Online video courses:
- Statistical Methods for NLP, by Joakim Nivre, Uppsala, 2012
- Machine Learning by Andrew Ng, Stanford 2007
- Natural Language Processing by Chris Manning, Stanford
Online Courses
Tutorials
- A Statistical MT Tutorial Workbook, Kevin Knight, 1999
- Bayesian Inference with Tears, Kevin Knight, 2009
- Structured Prediction for Natural Language Processing, Noah Smith, ICML 2009
- Gibbs Sampling for the Uninitiated, Philip Resnik and Eric Hardisty, June 2010
- A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing, Alexander M. Rush and Michael Collins, 2011
- Classification for NLP, Dan Klein, ACL 2007
- Structured Bayesian Nonparametric Models with Variational Inference, Dan Klein and Percy Liang, ACL 2007
Books:
- Introduction to Natural Language Processing, bySteven Bird, Ewan Klein and Edward Loper, 2009, distributed on the NLTK site.
- Pattern Recognition and Machine Learning, Chris Bishop, 2007.
- Information Theory, Inference, and Learning Algorithms, David J.C. MacKay, 2003.
- The LingPipe Java library suite - by Bob Carpenter. Information extraction and data mining tools.
- Text Analysis with LingPipe, Bob Carpenter, 2011. A practical book on using Lingpipe. Covers strings, streams, regular expressions, corpora readers, tokenization, language models, classifiers and Latent Dirichlet Allocation.
Python
1. Google's intro to Python
2. Python Ecosystem: notes on installing Python in your environment
3. Introduction to Python Generators
4. Scikit Learn: Machine Learning in Python --
Extracting text from HTML:
- The JusText Python package does a very good job of cleaning up HTML by removing boilerplate HTML code around "interesting text".
- Decruft is a Python implementation of the Readability algorithm - easy to use, works well.
- Readability browser bookmarklet Readability is a browser bookmarklet that wipes out all that junk so you can have a more enjoyable reading experience. It works with all the latest browsers and its success rate is pretty respectable (we'd guess over 90% of web sites are handled properly). It is implemented in Jscript and relies on comparing the HTML link density of HTML elements: clean text have less density than junk.
- CleanEval homepage: CLEANEVAL is a shared task and competitive evaluation on the topic of cleaning arbitrary web pages, with the goal of preparing web data for use as a corpus, for linguistic and language technology research and development. (2007)
- Python code to clean HTML pages -- based on the idea that elements in HTML with "clean" text have less HTML link density than "useless" elements. (2008)
Bob Carpenter's Computational Linguistics Syllabus - Excellent collection of references in the field.
The NLTK Toolkit - Python toolkit with thorough tutorial on all Natural Language topics and access to online datasets.
Corpus Linguistics, Resources and Normalisation, Sylvain Pogodalla 2008

Last modified Jan 15, 2015 Michael Elhadad