This lecture describes the task of Language Modeling as an example of a statistical analysis of natural language.
We define the task and illustrate how Language Modeling helps achieve practical tasks (word prediction, spelling correction, ASR, machine translation). We introduce basic probabilistic tools required to produce and evaluate language models.
What is Language Modeling
A Language Model is a probabilistic model which predicts the probability that a sequence of tokens
belongs to a language. (Compare with the deterministic membership models of formal languages - what is the
complexity of determining that a sentence belongs to a regular language, a context-free language or a
How LM helps in NLP Tasks
The probabilities returned by a language model are mostly useful to compare the likelihood that different
sentences are "good sentences". This is useful in many practical tasks, for example:
- Spell checking: we observe a word which is not recognized as a known word as part of a sentence (that is, the word does not occur in a list of known words). Using the edit distance algorithm, we find the closest known words
to the unknown words. These are the candidate corrections. For example, we observe the word "wird" in the context of the sentence "I like to write this wird". The candidate corrections are ["word", "weird", "wind"].
How can we select among these candidates the most likely correction for the suspected error "wird"?
We compare the Language Model probability of the sentences:
and we hope that the right correction (word) will be selected.
- p(I like to write this word.)
- p(I like to write this weird.)
- p(I like to write this wind.)
- Automatic Speech Recognition: we receive as input a string of phonemes; a first model predicts for sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the most likely
sequence of words compatible with the candidate words produced by the acoustic model.
- Machine Translation: each word from the source language is mapped to multiple candidate words in the target language; the language model in the target language can rank the most likely sequence of candidate target words.
N-grams are the simplest tool available to construct a language model.
An N-gram is a sequence of N words.
An N-gram model predicts the probability of a given N-gram within any sequence of words in the language.
If we have a good N-gram model, we can predict p(w | h) - that is, what is the probability of seeing word w
given a history of previous words h - where the history contains n-1 words.
To construct an N-gram model, we must estimate this probability.
We start from a corpus - which is a representative collection of texts large enough to observe statistically meaningful frequencies of words and word sequences. Very large corpora exist online - ranging in size from a few million words to multiple thousands of billions of words.
Given a corpus, we estimate p(w | h) as relative frequency counts: C(hw) / C(w)
There is a problem with this simple method of counting observed N-grams: there is an exponential number of
possible N-grams (W^N where W is the size of the dictionary - typically about 200,000 distinct words and N is the
size of the N-gram) while we observe only about L N-grams in a corpus of length L - and in fact, much less than L distinct N-grams, since many frequent N-grams appear many times. As a consequence, the count-based estimation method will return 0 for too many N-grams.
We will see below a few techniques used to avoid predicting 0 for unseen N-grams. These methods are called "smoothing" techniques.
Probability Axioms and the Chain Rule
Assume we have constructed an N-gram model. How do we obtain a complete language model on the basis of an N-gram model? For example, assume we have constructed a bigram model (N-gram model where N=2). How do we obtain the
values p(w1...ws) for any sequence of any length s.
We proceed in two steps:
The chain rule of probability is an exact rewriting of a joint-probability:
- We apply the chain rule of probability
- We then apply a very strong simplification assumption to allow us to compute p(w1...ws) in an easy manner.
What we obtain then is an approximation.
p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)
The chain rule explains how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.
But we do not have access to these conditional probabilities, with complex conditions of up to n-1 words.
This is where we introduce a simplification assumption: we assume for all conditions, that
p(wk | w1...wk-1) = p(wk | wk-1)
That is, we approximate the history (the context) of the word wk by looking only at the last word of the context. This assumption is called the Markov assumption. (We used it here with a simplified context of length 1 - which corresponds to a bigram model - we could use larger fixed-sized histories in general).
Given this simplifying assumption, we obtain the following approximation:
p(w1...ws) = p(w1) * Product( p(wk | wk-1) for k=2...s)
It is often useful to introduce a dummy delimiter word which indicates start of sentence and one for end of sentence. These are often denoted <s> and </s> or simply SOS and EOS. These delimiters simplify the formal analysis.
Evaluating LMs: Perplexity
Now that we have constructed one language model, we ask: how good is it?
Especially because we introduced an approximation (using the Markov assumption) - how far are we from the "real" probability of the sentences? Is a trigram model much better than a bigram? By how much?
To answer this question - we have two strategies:
Note that a probabilistic model does not predict specific data. Instead, it assigns a predicted probability to possible data. In the case of a language model, the model predicts the probability of the next word given the observed history. It does not predict a single "best next word" - it assigns the probability to observe all possible words in a given context.
The measure used to assess such models is called perplexity. Given a test dataset w1...wN, the perplexity of the model p is p(w1...wN)^1/N.
For example, for a bigram model, the perplexity formula is:
- Extrinsic evaluation: this is an end-to-end evaluation of a task we know how to evaluate. For example, assume we know how to measure the quality of a spell-checker, and this spell-checker relies on a language model. We want to compare LM1 and LM2. We evauate the spell-checker when using LM1, then using LM2 and
compare the quality of the resulting spell-checkers. This is the best possible type of evaluation - but often
an expensive form of evaluation.
- Intrinsic evaluation: this evaluates the quality of a model without any dependence on an external task, but only by measuring a metric of the internal parameters of the model. Intrinsic evaluations are most often achieved by using a training / testing dataset. We "train" the probabilistic model on training data used to estimate the probabilities. We then apply the model on the test dataset and compare the predictions made by the trained model and the observed data. The less differences, the better the model.
Product( 1/ p(wi | wi-1) i=1..N ) ^ 1/N
Observe how the power 1/N is a normalization factor - it allows us to obtain a measure which is independent of the size of the training dataset N.
We will see later that perplexity is related to the entropy of the model p (see
Unknown Words, Zeros
The models we have assumed so far suffer from drastic problems:
These problems are called unknown words and sparse training.
Assigning a probability of 0 to an N-gram is a drastic decision - because it means that any sentence that contains this N-gram is deemed as impossible in the language model and will also receive a 0 probability.
The other problem of assigning a 0 probability to an N-gram is that it means that other N-grams are under-estimated. Because the mass of probabilities must sum to 1.0, it means the other N-grams (those that have a non-zero probability) are over-estimated.
The solution to the problem of unknown words is to predict that some of the words we will observe in new text are part of a new category of words - a category we will call "unknown". By definition, if the corpus we use is large enough, the unknown words will all be relatively low-frequency words. We must assess what is the probability of meeting an unknown word and we must estimate in which contexts these unknown words will occur (within which N-gram). See Section 4.3 of Language Modeling with Ngrams from Speech and Language Processing (SPL3) for a brief discussion.
- We assume we know all the words in the vocabulary. This is in fact very far from the truth.
- We assume that after training, we have observed most possible N-grams. In fact, for a corpus of length L,
we observe about L/10 to L/100 distinct N-words, which is a very low portion of the possible W^N distinct N-grams that are theoretically possible.
Probability Estimation, Smoothing
The solution to the problem of unseen N-grams is to re-distribute some of the probability mass from the observed frequencies to unseen N-grams. This is a general problem in probabilistic modeling called smoothing.
We will learn general techniques to solve smoothing as part of more general estimation techniques in Lecture 4.
See Section 4.4 of Language Modeling with Ngrams from Speech and Language Processing (SPL3) for a presentation of the classical smoothing techniques (Laplace, add-k).
State of the Art: Neural Language Models
In the past 5 years, there has been tremendous progress in language modeling using Neural Language Models.
We will introduce the basics of Deep Learning for NLP in Lecture 3. For a review of recent progress in Neural Language Modeling, see Lecture 8: Recurrent Neural Networks and Language Models from Richard Socher in the Deep Learning for NLP course at Stanford University, Fall 2017.
The lecture compares traditional language models using N-grams as presented here with the recent recurrent neural networks (RNNs) introduced in the recent past. RNNs capture a better representation of the history of the sentences to compute the prediction of the distribution of candidate next words.
Last modified 29 Oct 2017