NLP 21
Language Modeling
This lecture describes the task of Language Modeling as an example of a statistical analysis of natural language.
We define the task and illustrate how Language Modeling helps achieve practical tasks (word prediction, spelling correction, ASR, machine translation).
We introduce basic probabilistic tools required to produce and evaluate language models.
What is Language Modeling
A Language Model is a probabilistic model which predicts the probability that a sequence of tokens
belongs to a language. (Compare with the deterministic membership models of formal languages - what is the
complexity of determining that a sentence belongs to a regular language, a context-free language or a
context-dependent language?)
Chomsky Hierarychy of Formal Language
- A formal language is defined as a (usually infinite) set of sequences (units can be words or letters)
- The key task performed on languages is the "membership test" (known as the "decision problem") - given a sentence, can we determine algorithmically that the sentence belongs to the language.
- The tool used to model this task is a "formal grammar" with a parsing algorithm (based on automata variants or Turing Machines).
- The Chomsky Hierarchy refers to families of grammars and corresponding parsing algorithms with increasing space/time complexity.
- The key levels in the hierarchy are:
- regular languages: rules of the form {B → a; B → Ac ; B → ε}; Can be parsed in O(n) steps with a finite-state automaton;
- context free languages: rules of the form {A → a non-empty string of terminals or non-terminals} ; example: properly matched parentheses or palindromes;
Can be parsed in O(n3). (Equivalence, disjointness, containment are undecidable.)
- context sensitive languages: rules of the form {αAβ → αγβ} where γ is a non-empty sequence of terminals or non-terminals;
examples: anbncn; {ap such that p is prime}; decision problem is decidable; emptiness test is undecidable.
- recursively enumerable languages: rules of the form {α → β} where α contains at least one non-terminal.
Formal Languages vs. LMs
- The membership test of formal languages is categorical.
- Probabilistic language models provide a non-categorical perspective on the same task.
- LMs can be used without explicit representation of a grammar describing the language - instead, they can be induced from a sample of the language.
How LM helps in NLP Tasks
The probabilities returned by a language model are mostly useful to compare the likelihood that different
sentences are "good sentences." In other words, we do not really care about the absolute value of a probability,
but it is helpful to compare the probabilities of two distinct sequences.
This is useful in many practical tasks, for example:
- Spell checking: we observe a word which is not recognized as a known word as part of a sentence (that is, the word does not occur in a list of known words). Using the Edit Distance algorithm, we find the closest known words to the unknown word.
These are candidate corrections. For example, we observe the word "wird" in the context of the sentence "I like to write this wird". The candidate corrections are ["word", "weird", "wind"]. How can we select among these candidates the most likely correction for the suspected error "wird"?
We compare the Language Model probability of the sentences:
- p(I like to write this word.)
- p(I like to write this weird.)
- p(I like to write this wind.)
and we hope that the right correction (word) will be selected.
- Automatic Speech Recognition: we receive as input a string of phonemes; a first model predicts for sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the most likely sequence of words compatible with the candidate words produced by the acoustic model.
- Machine Translation: each word from the source language is mapped to multiple candidate words in the target language; the language model in the target language can rank the most likely sequence of candidate target words.
N-grams
We now address the question of how to "learn" a language model given a sample of observed sequences of words.
N-grams are the simplest tool available to construct a language model.
An N-gram is a sequence of N words.
An N-gram model predicts the probability of a given N-gram within any sequence of words in the language.
If we have a good N-gram model, we can predict \(p(w | h)\) - that is, what is the probability of seeing word \(w\)
given a history of previous words \(h\) - where the history contains \(n-1\) words.
To construct an N-gram model, we must estimate this probability.
We start from a corpus - which is a representative collection of texts large enough to observe statistically meaningful frequencies of words and word sequences. Very large corpora exist online - ranging in size from a few million words to multiple thousands of billions of words.
Given a corpus, we estimate \(p(w | h)\) as relative frequency counts: \(Count(hw) \over Count(w)\)
There is a problem with this simple method of counting observed N-grams: there is an exponential number of
possible N-grams (\(W^N\) where \(W\) is the size of the dictionary - typically about 200,000 distinct words and \(N\) is the size of the N-gram)
while we observe only about \(L\) N-grams in a corpus of length \(L\) - and in fact, much less than \(L\) distinct N-grams, since many frequent N-grams appear many times.
As a consequence, the count-based estimation method will return 0 for too many N-grams that have never been seen in the training corpus.
We will learn later techniques used to avoid predicting 0 for unseen N-grams. These methods are called "smoothing" techniques.
Probability Axioms and the Chain Rule
Assume we have constructed an N-gram model.
How do we obtain a complete language model on the basis of an N-gram model?
For example, assume we have constructed a bigram model (N-gram model where \(N=2\)).
How do we obtain the values \(p(w_1 \dots w_s)\) for any sequence of any length \(s\)?
We proceed in two steps:
- We apply the chain rule of probability
- We then apply a very strong simplification assumption to allow us to compute \(p(w_1 \dots w_s)\) in an easy manner.
What we obtain then is an approximation.
The chain rule of probability is an exact rewriting of a joint-probability:
$$p(w_1 \dots w_s) = p(w_1) \times p(w_2 | w_1) \times p(w_3 | w_1 w_2) \times p(w_4 | w_1 w_2 w_3) \dots \times p(w_s | w_1 \dots w_{s-1})$$
The chain rule explains how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.
But we do not have access to these conditional probabilities, with complex conditions of up to \(s-1\) words.
This is where we introduce a simplification assumption: we assume for all conditions, that
$$p(w_k | w_1 \dots w_{k-1}) = p(w_k | w_{k-1})$$
That is, we approximate the history (the context) of the word \(w_k\) by looking only at the last word of the context.
This assumption is called the Markov assumption. (We used it here with a simplified context of length 1 - which corresponds to a bigram model - we could use larger fixed-sized histories in general).
Given this simplifying assumption, we obtain the following approximation:
$$p(w_1 \dots w_s) = p(w_1) \times \prod_{k=2}^{k=s}{p(w_k | w_k-1)}$$
Each of the terms \(p(w_i | w_j)\) is a learned parameter of the model \(p\).
How many parameters do you expect there are in a model trained on a corpus with W distinct words in the vocabulary?
It is often useful to introduce a dummy delimiter word which indicates start of sentence and one for end of sentence. These are often denoted <s> and </s> or simply SOS and EOS. These delimiters simplify the formal analysis.
Evaluating LMs: Perplexity
Now that we have constructed one language model, we ask: how good is it?
Especially because we introduced an approximation (using the Markov assumption) - how far are we from the "real" probability of the sentences?
Is a trigram model much better than a bigram? By how much?
To answer this question - we apply two strategies:
- Extrinsic evaluation: this is an end-to-end evaluation of a task we know how to evaluate.
For example, assume we know how to measure the quality of a spell-checker, and this spell-checker relies on a language model. We want to compare \(LM_1\) and \(LM_2\).
We evauate the spell-checker when using \(LM_1\), then using \(LM_2\) and compare the quality of the resulting spell-checkers.
This is the best possible type of evaluation - but often an expensive form of evaluation.
- Intrinsic evaluation: this evaluates the quality of a model without any dependence on an external task, but only by measuring a metric of the internal parameters of the model. Intrinsic evaluations are most often achieved by using a training / testing dataset. We "train" the probabilistic model on training data used to estimate the probabilities. We then apply the model on the test dataset and compare the predictions made by the trained model and the observed data. The less differences, the better the model.
Note that a probabilistic model does not predict specific data. Instead, it assigns a predicted probability to possible data. In the case of a language model, the model predicts the probability of the next word given the observed history. It does not predict a single "best next word" - it assigns the probability to observe all possible words in a given context.
The measure used to assess such models is called perplexity. Given a test dataset \(w_1 \dots w_L\), the perplexity of the model p is
\(p(w_1 \dots w_L)^{-1/L}\).
For example, for a bigram model \(p\), which has been trained on a distinct corpus, the formula to compute the perplexity of a new sequence of words
\(W = w_0 \dots w_L\) is by definition:
$$(1) Perplexity_p(W) = \left(\prod_{i=1}^{L}{1 \over p(w_i | w_{i-1})}\right)^{1 \over L}$$
Observe how the power \(1/L\) is a normalization factor - it allows us to obtain a measure which is independent of the size of the tested sequence \(L\).
We will see later that perplexity is related to the entropy of the model \(p\)
(see Information theory
or SPL, Section 3.7).
The lower the perplexity, the better the model.
An intuitive way to interpret perplexity is as the "branching factor" of a model, that is, given a history of \(s-1\) words, we need to predict the next word.
Perplexity can be understood as the number of options (distinct words) the model is considering as candidates to predict to be the "next word".
Perplexity Example
(From Speech and Language Processing, Jurafsky & Martin, Section 3.3):
We trained unigram, bigram, and trigram grammars on 38 million words (including start-of-sentence tokens) from the Wall Street Journal,
using a 19,979 word vocabulary.
We then computed the perplexity of each of these models on a test set of 1.5 million words with equation (1).
The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars.
| Unigram | Bigram | Trigram |
Perplexity | 962 | 170 | 109 |
As we see above, the more information the n-gram gives us about the word sequence, the lower the perplexity.
Note that in computing perplexities, the n-gram model P must be constructed without any knowledge of the test set or any prior knowledge of the vocabulary of
the test set. Any kind of knowledge of the test set can cause the perplexity to be artificially low.
The perplexity of two language models is only comparable if they use identical vocabularies.
Unknown Words, Zeros
The models we have assumed so far suffer from drastic problems:
- We assume we know all the words in the vocabulary. This is in fact very far from the truth.
- We assume that after training, we have observed most possible N-grams. In fact, for a corpus of length \(L\),
we observe about \(L/10\) to \(L/100\) distinct N-words, which is a very low portion of the possible \(W^N\) distinct N-grams that are theoretically possible.
These two problems are called unknown words and sparse training.
Assigning a probability of \(0\) to an N-gram is a drastic decision - because it means that any sentence that contains this N-gram is deemed impossible in the language model and will also receive a 0 probability.
The other problem of assigning a \(0\) probability to an N-gram is that it means that other N-grams are under-estimated. Because the mass of probabilities must sum to
1.0, it means the other N-grams (those that have a non-zero probability) are over-estimated.
The solution to the problem of unknown words is to predict that some of the words we will observe in new text are part of a new category of words - a category we will call "unknown". By definition, if the corpus we use is large enough, the unknown words will all be relatively low-frequency words. We must assess what is the probability of meeting an unknown word and we must estimate in which contexts these unknown words will occur (within which N-gram). See Section 3.3 of N-gram Language Modeling from Speech and Language Processing (SPL3) for a brief discussion.
Probability Estimation, Smoothing
The solution to the problem of unseen N-grams is to re-distribute some of the probability mass from the observed frequencies to unseen N-grams.
This is a general problem in probabilistic modeling called smoothing.
We will learn general techniques to solve smoothing as part of more general estimation techniques in Lecture 4.
See Section 3.4 of N-grams Language Modeling from Speech and Language Processing (SPL3) for a presentation of the classical smoothing techniques (Laplace, add-k).
State of the Art: Neural Language Models
In the past 5 years, there has been tremendous progress in language modeling using Neural Language Models.
We will introduce the basics of Deep Learning for NLP in Lecture 3. For a review of recent progress in Neural Language Modeling, see Lecture 8: Recurrent Neural Networks and Language Models from Richard Socher in the Deep Learning for NLP course at Stanford University, Fall 2017.
The lecture compares traditional language models using N-grams as presented here with the recent recurrent neural networks (RNNs) introduced in the recent past. RNNs capture a better representation of the history of the sentences to compute the prediction of the distribution of candidate next words.
The strongest methods of language modeling are now implemented using an architecture called "transformers".
Transformers are trained on variant tasks of the "next word prediction" task, most frequently with the "masked word prediction" task:
given a sentence \(w_1 \dots w_n\), hide the word \(w_j\) and ask the transformer model to guess the missing word.
This task makes it more explicit for the model to base the prediction of the word based on the context - both left and right - of
the missing word. The following lecture Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention provides a good introduction to the topic (1 hour video).
Last modified 25 Oct 2020