NLP 21

Language Modeling

This lecture describes the task of Language Modeling as an example of a statistical analysis of natural language. We define the task and illustrate how Language Modeling helps achieve practical tasks (word prediction, spelling correction, ASR, machine translation). We introduce basic probabilistic tools required to produce and evaluate language models.

What is Language Modeling

A Language Model is a probabilistic model which predicts the probability that a sequence of tokens belongs to a language. (Compare with the deterministic membership models of formal languages - what is the complexity of determining that a sentence belongs to a regular language, a context-free language or a context-dependent language?)

Chomsky Hierarychy of Formal Language

Formal Languages vs. LMs

QUIZZ 02e01


How LM helps in NLP Tasks

The probabilities returned by a language model are mostly useful to compare the likelihood that different sentences are "good sentences." In other words, we do not really care about the absolute value of a probability, but it is helpful to compare the probabilities of two distinct sequences. This is useful in many practical tasks, for example:

QUIZZ 02e02


N-grams

We now address the question of how to "learn" a language model given a sample of observed sequences of words. N-grams are the simplest tool available to construct a language model.

An N-gram is a sequence of N words.

An N-gram model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict \(p(w | h)\) - that is, what is the probability of seeing word \(w\) given a history of previous words \(h\) - where the history contains \(n-1\) words.

To construct an N-gram model, we must estimate this probability.

We start from a corpus - which is a representative collection of texts large enough to observe statistically meaningful frequencies of words and word sequences. Very large corpora exist online - ranging in size from a few million words to multiple thousands of billions of words.

Given a corpus, we estimate \(p(w | h)\) as relative frequency counts: \(Count(hw) \over Count(w)\)

There is a problem with this simple method of counting observed N-grams: there is an exponential number of possible N-grams (\(W^N\) where \(W\) is the size of the dictionary - typically about 200,000 distinct words and \(N\) is the size of the N-gram) while we observe only about \(L\) N-grams in a corpus of length \(L\) - and in fact, much less than \(L\) distinct N-grams, since many frequent N-grams appear many times. As a consequence, the count-based estimation method will return 0 for too many N-grams that have never been seen in the training corpus.

We will learn later techniques used to avoid predicting 0 for unseen N-grams. These methods are called "smoothing" techniques.

QUIZZ 02e03


Probability Axioms and the Chain Rule

Assume we have constructed an N-gram model. How do we obtain a complete language model on the basis of an N-gram model? For example, assume we have constructed a bigram model (N-gram model where \(N=2\)).

How do we obtain the values \(p(w_1 \dots w_s)\) for any sequence of any length \(s\)?

We proceed in two steps:

  1. We apply the chain rule of probability
  2. We then apply a very strong simplification assumption to allow us to compute \(p(w_1 \dots w_s)\) in an easy manner. What we obtain then is an approximation.
The chain rule of probability is an exact rewriting of a joint-probability: $$p(w_1 \dots w_s) = p(w_1) \times p(w_2 | w_1) \times p(w_3 | w_1 w_2) \times p(w_4 | w_1 w_2 w_3) \dots \times p(w_s | w_1 \dots w_{s-1})$$ The chain rule explains how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.

But we do not have access to these conditional probabilities, with complex conditions of up to \(s-1\) words.

This is where we introduce a simplification assumption: we assume for all conditions, that $$p(w_k | w_1 \dots w_{k-1}) = p(w_k | w_{k-1})$$ That is, we approximate the history (the context) of the word \(w_k\) by looking only at the last word of the context. This assumption is called the Markov assumption. (We used it here with a simplified context of length 1 - which corresponds to a bigram model - we could use larger fixed-sized histories in general).

Given this simplifying assumption, we obtain the following approximation: $$p(w_1 \dots w_s) = p(w_1) \times \prod_{k=2}^{k=s}{p(w_k | w_k-1)}$$ Each of the terms \(p(w_i | w_j)\) is a learned parameter of the model \(p\). How many parameters do you expect there are in a model trained on a corpus with W distinct words in the vocabulary?

It is often useful to introduce a dummy delimiter word which indicates start of sentence and one for end of sentence. These are often denoted <s> and </s> or simply SOS and EOS. These delimiters simplify the formal analysis.

QUIZZ 02e04


Evaluating LMs: Perplexity

Now that we have constructed one language model, we ask: how good is it? Especially because we introduced an approximation (using the Markov assumption) - how far are we from the "real" probability of the sentences? Is a trigram model much better than a bigram? By how much?

To answer this question - we apply two strategies:

  1. Extrinsic evaluation: this is an end-to-end evaluation of a task we know how to evaluate. For example, assume we know how to measure the quality of a spell-checker, and this spell-checker relies on a language model. We want to compare \(LM_1\) and \(LM_2\). We evauate the spell-checker when using \(LM_1\), then using \(LM_2\) and compare the quality of the resulting spell-checkers. This is the best possible type of evaluation - but often an expensive form of evaluation.
  2. Intrinsic evaluation: this evaluates the quality of a model without any dependence on an external task, but only by measuring a metric of the internal parameters of the model. Intrinsic evaluations are most often achieved by using a training / testing dataset. We "train" the probabilistic model on training data used to estimate the probabilities. We then apply the model on the test dataset and compare the predictions made by the trained model and the observed data. The less differences, the better the model.
Note that a probabilistic model does not predict specific data. Instead, it assigns a predicted probability to possible data. In the case of a language model, the model predicts the probability of the next word given the observed history. It does not predict a single "best next word" - it assigns the probability to observe all possible words in a given context.

The measure used to assess such models is called perplexity. Given a test dataset \(w_1 \dots w_L\), the perplexity of the model p is \(p(w_1 \dots w_L)^{-1/L}\).

For example, for a bigram model \(p\), which has been trained on a distinct corpus, the formula to compute the perplexity of a new sequence of words \(W = w_0 \dots w_L\) is by definition: $$(1) Perplexity_p(W) = \left(\prod_{i=1}^{L}{1 \over p(w_i | w_{i-1})}\right)^{1 \over L}$$ Observe how the power \(1/L\) is a normalization factor - it allows us to obtain a measure which is independent of the size of the tested sequence \(L\).

We will see later that perplexity is related to the entropy of the model \(p\) (see Information theory or SPL, Section 3.7).

The lower the perplexity, the better the model. An intuitive way to interpret perplexity is as the "branching factor" of a model, that is, given a history of \(s-1\) words, we need to predict the next word. Perplexity can be understood as the number of options (distinct words) the model is considering as candidates to predict to be the "next word".

Perplexity Example

(From Speech and Language Processing, Jurafsky & Martin, Section 3.3):
We trained unigram, bigram, and trigram grammars on 38 million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words with equation (1). The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word sequence, the lower the perplexity. Note that in computing perplexities, the n-gram model P must be constructed without any knowledge of the test set or any prior knowledge of the vocabulary of the test set. Any kind of knowledge of the test set can cause the perplexity to be artificially low. The perplexity of two language models is only comparable if they use identical vocabularies.

Unknown Words, Zeros

The models we have assumed so far suffer from drastic problems:
  1. We assume we know all the words in the vocabulary. This is in fact very far from the truth.
  2. We assume that after training, we have observed most possible N-grams. In fact, for a corpus of length \(L\), we observe about \(L/10\) to \(L/100\) distinct N-words, which is a very low portion of the possible \(W^N\) distinct N-grams that are theoretically possible.

These two problems are called unknown words and sparse training.

Assigning a probability of \(0\) to an N-gram is a drastic decision - because it means that any sentence that contains this N-gram is deemed impossible in the language model and will also receive a 0 probability.

The other problem of assigning a \(0\) probability to an N-gram is that it means that other N-grams are under-estimated. Because the mass of probabilities must sum to 1.0, it means the other N-grams (those that have a non-zero probability) are over-estimated.

The solution to the problem of unknown words is to predict that some of the words we will observe in new text are part of a new category of words - a category we will call "unknown". By definition, if the corpus we use is large enough, the unknown words will all be relatively low-frequency words. We must assess what is the probability of meeting an unknown word and we must estimate in which contexts these unknown words will occur (within which N-gram). See Section 3.3 of N-gram Language Modeling from Speech and Language Processing (SPL3) for a brief discussion.


Probability Estimation, Smoothing

The solution to the problem of unseen N-grams is to re-distribute some of the probability mass from the observed frequencies to unseen N-grams. This is a general problem in probabilistic modeling called smoothing.

We will learn general techniques to solve smoothing as part of more general estimation techniques in Lecture 4. See Section 3.4 of N-grams Language Modeling from Speech and Language Processing (SPL3) for a presentation of the classical smoothing techniques (Laplace, add-k).

QUIZZ 02e05


State of the Art: Neural Language Models

In the past 5 years, there has been tremendous progress in language modeling using Neural Language Models. We will introduce the basics of Deep Learning for NLP in Lecture 3. For a review of recent progress in Neural Language Modeling, see Lecture 8: Recurrent Neural Networks and Language Models from Richard Socher in the Deep Learning for NLP course at Stanford University, Fall 2017. The lecture compares traditional language models using N-grams as presented here with the recent recurrent neural networks (RNNs) introduced in the recent past. RNNs capture a better representation of the history of the sentences to compute the prediction of the distribution of candidate next words.

The strongest methods of language modeling are now implemented using an architecture called "transformers". Transformers are trained on variant tasks of the "next word prediction" task, most frequently with the "masked word prediction" task: given a sentence \(w_1 \dots w_n\), hide the word \(w_j\) and ask the transformer model to guess the missing word. This task makes it more explicit for the model to base the prediction of the word based on the context - both left and right - of the missing word. The following lecture Stanford CS224N: NLP with Deep Learning | Winter 2019 | Lecture 14 – Transformers and Self-Attention provides a good introduction to the topic (1 hour video).


Last modified 25 Oct 2020