Michael Elhadad

Natural Language Processing (202-2-5211)

Meets:
Sun 10-12 Bdg 34 Room 103
Thu 10-12 Bdg 97 Room 203

News:

  1. 20 Apr 12: HW1 is available
  2. 29 Apr 12: Google.py: An updated Python version that gets Google results. The package called xgoogle referenced in the assignment stopped working with latest Google versions. Tal Baumel submitted this function which seems to be working fine now.
  3. 10 May 12: Please request slots for frontal checking of HW1: slots.
  4. 22 May 12: HW2 is available
  5. 14 Jun 12: HW3 is available
  6. 21 Jun 12: Please request slots for frontal checking of HW2: slots.
  7. 25 Jul 12: Notes on Wikification in Hebrew: how to use JWPL for the Hebrew Wikipedia Dump.
  8. 26 Jul 12: Notes on morphological analysis in Hebrew: how to use Meni Adler's morphological disambiguation on Hebrew text.

Objectives

The course is an introduction to Natural Language Processing. The main objective of the course is to learn how to develop practical computer systems capable of performing intelligent tasks on natural language: analyze, understand and generate written text. This task requires learning material from several fields: linguistics, machine learning and statistical analysis, and core natural language techniques.
  1. Acquire basic understanding of linguistic concepts and natural language complexity: variability (the possibility to express the same meaning in many different ways) and ambiguity (the fact that a single expression can refer to many different meanings in different contexts); levels of linguistic description (word, sentence, text; morphology, syntax, semantics, pragmatics). Schools of linguistic analysis (functional, distributional, Chomskyan); Empirical methods in Linguistics; Lexical semantics; Syntactic description; Natural language semantics issues.
  2. Acquire basic understanding of machine learning techniques as applied to language: supervised vs. unsupervised methods; training vs. testing; classification; regression; distributions, KL-divergence; Bayesian methods; Support Vector Machines; Perceptron;
  3. Natural language processing techniques: word and sentence tokenization; parts of speech tagging; lemmatization and morphological analysis; chunking; named entity recognition; n-gram language models; probabilistic context free grammars; probabilistic dependency grammars; parsing accuracy metrics; Treebank analysis; Text simplification; Paraphrase detection; Summarization; Text generation.

Lecture Notes
  1. General Intro to NLP - Linguistic Concepts
  2. Peter Norvig: How to Write a Spelling Corrector (2007) - toy spelling corrector illustrating the statistical NLP method (probability theory, dealing with large collections of text, learning language models, evaluation methods).
  3. Parts of speech Tagging

    Things to do this week:

    1. Learn Python:
    2. Install a good Python environment: Python's ecosystem.
    3. Experiment with parts of speech tagging of English text online: NLTK APIs demo
    4. Install NLTK: Notes for 2.7
    5. Explore the Brown corpus of parts-of-speech tagged English text using NLTK's corpus reader and FreqDist object:
      • What are the 10 most common words in the corpus?
      • What are the 5 most common tags in the corpus?
    6. Read Chapter 5 of the NLTK book

  4. Basic Statistic Concepts
  5. Context Free Grammars Parsing
  6. Probabilistic Context Free Grammars Parsing
  7. Notes on computing KL-divergence
  8. NLTK tools for PCFG parsing
  9. Automatic Text Summarization
Topics covered in class will include:
  1. Descriptive linguistic models
  2. Language Models and n-grams -- Statistical Models of Unseen Data (Smoothing)
  3. Parts of speech tagging, morphology
  4. Information Extraction / Named Entity Recognition
  5. Syntactic descriptions: Parsing sentence, why, how, PCFGs, Dependency Parsing
  6. Using Machine Learning Tools: Classification, Sequence Labeling / Supervised Methods / SVM and CRF
  7. Bayesian Statistics, generative models, topic models, LDA
  8. Compositional Semantic from CFG Parsing
  9. Text Summarization
  10. Sentence Simplification
  11. Text Generation

Assignments
Software
Resources

Last modified May 15th, 2012 Michael Elhadad