Language Models and n-grams

A language model encodes a priori knowledge about a language. It is used to answer questions such as:
P(X | I, want) = ?
that is, what is the probability of seeing a word X after a sequence of word "I want". n-gram models answer such questions by estimating the probability of sequences of n words and estimating p(<w1...wn> | <w1...wn-1>).

Learning Language Models

Given a corpus, one can learn an n-gram model by first counting the number of occurrences of each n-gram. We refer to the count of <w1...wn> in the corpus as C(w1...wn).

The objective of the learning method is to estimate p(w1...wn) given C(w1...wn).

The first method to derive the estimation p from C is to use only the data seen in the corpus. This method is called Maximum Likelihood Estimation (MLE). The MLE is in general a poor estimate, because it leaves 0 probability for all n-grams that do not appear in the actual corpus. Since the corpus contains in general a very small quantity of n-grams from the theoretical quantity of possible n-grams, MLEs are not an appropriate estimator.

The following sources present better methods for smoothing - that is, the process of deriving a good estimation given frequency counts in a data-sparse case:

  1. Chapter 4: N-grams of SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, by Jurafsky and Martin, 2006.
  2. Slides from Jan Hajic
  3. Slides by Chris Manning


Last modified June 17th, 2007