- What is Language Modeling
- How LM helps in NLP Tasks
- N-grams
- Probability Axioms and the Chain Rule
- Evaluating LMs: Perplexity
- Unknown Words, Zeros
- Probability Estimation, Smoothing
- State of the Art: Neural Language Models

**Spell checking**: we observe a word which is not recognized as a known word as part of a sentence (that is, the word does not occur in a list of known words). Using the edit distance algorithm, we find the closest known words to the unknown words. These are the candidate corrections. For example, we observe the word "wird" in the context of the sentence "I like to write this wird". The candidate corrections are ["word", "weird", "wind"]. How can we select among these candidates the most likely correction for the suspected error "wird"?We compare the Language Model probability of the sentences:

- p(I like to write this word.)
- p(I like to write this weird.)
- p(I like to write this wind.)

**Automatic Speech Recognition**: we receive as input a string of phonemes; a first model predicts for sub-sequences of the stream of phonemes candidate words; the language model helps in ranking the most likely sequence of words compatible with the candidate words produced by the acoustic model.**Machine Translation**: each word from the source language is mapped to multiple candidate words in the target language; the language model in the target language can rank the most likely sequence of candidate target words.

An N-gram is a sequence of N words.

An N-gram model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) - that is, what is the probability of seeing word w given a history of previous words h - where the history contains n-1 words.

To construct an N-gram model, we must estimate this probability.

We start from a **corpus** - which is a representative collection of texts large enough to observe statistically meaningful frequencies of words and word sequences. Very large corpora exist online - ranging in size from a few million words to multiple thousands of billions of words.

Given a corpus, we estimate p(w | h) as relative frequency counts: C(hw) / C(w)

There is a problem with this simple method of counting observed N-grams: there is an exponential number of possible N-grams (W^N where W is the size of the dictionary - typically about 200,000 distinct words and N is the size of the N-gram) while we observe only about L N-grams in a corpus of length L - and in fact, much less than L distinct N-grams, since many frequent N-grams appear many times. As a consequence, the count-based estimation method will return 0 for too many N-grams.

We will see below a few techniques used to avoid predicting 0 for unseen N-grams. These methods are called **"smoothing"** techniques.

We proceed in two steps:

- We apply the chain rule of probability
- We then apply a very strong simplification assumption to allow us to compute p(w1...ws) in an easy manner. What we obtain then is an approximation.

p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)The chain rule explains how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.

But we do not have access to these conditional probabilities, with complex conditions of up to n-1 words.

This is where we introduce a simplification assumption: we assume for all conditions, that

p(wk | w1...wk-1) = p(wk | wk-1)That is, we

Given this simplifying assumption, we obtain the following approximation:

p(w1...ws) = p(w1) * Product( p(wk | wk-1) for k=2...s)It is often useful to introduce a dummy delimiter word which indicates start of sentence and one for end of sentence. These are often denoted <s> and </s> or simply SOS and EOS. These delimiters simplify the formal analysis.

To answer this question - we have two strategies:

**Extrinsic evaluation**: this is an end-to-end evaluation of a task we know how to evaluate. For example, assume we know how to measure the quality of a spell-checker, and this spell-checker relies on a language model. We want to compare LM1 and LM2. We evauate the spell-checker when using LM1, then using LM2 and compare the quality of the resulting spell-checkers. This is the best possible type of evaluation - but often an expensive form of evaluation.**Intrinsic evaluation**: this evaluates the quality of a model without any dependence on an external task, but only by measuring a metric of the internal parameters of the model. Intrinsic evaluations are most often achieved by using a training / testing dataset. We "train" the probabilistic model on training data used to estimate the probabilities. We then apply the model on the test dataset and compare the predictions made by the trained model and the observed data. The less differences, the better the model.

The measure used to assess such models is called **perplexity**. Given a test dataset w1...wN, the perplexity of the model p is p(w1...wN)^1/N.

For example, for a bigram model, the perplexity formula is:

Product( 1/ p(wi | wi-1) i=1..N ) ^ 1/NObserve how the power 1/N is a normalization factor - it allows us to obtain a measure which is independent of the size of the training dataset N. We will see later that perplexity is related to the

- We assume we know all the words in the vocabulary. This is in fact very far from the truth.
- We assume that after training, we have observed most possible N-grams. In fact, for a corpus of length L, we observe about L/10 to L/100 distinct N-words, which is a very low portion of the possible W^N distinct N-grams that are theoretically possible.

These problems are called **unknown words** and **sparse training**.

Assigning a probability of 0 to an N-gram is a drastic decision - because it means that any sentence that contains this N-gram is deemed as impossible in the language model and will also receive a 0 probability.

The other problem of assigning a 0 probability to an N-gram is that it means that other N-grams are under-estimated. Because the mass of probabilities must sum to 1.0, it means the other N-grams (those that have a non-zero probability) are over-estimated.

The solution to the problem of unknown words is to predict that some of the words we will observe in new text are part of a new category of words - a category we will call "unknown". By definition, if the corpus we use is large enough, the unknown words will all be relatively low-frequency words. We must assess what is the probability of meeting an unknown word and we must estimate in which contexts these unknown words will occur (within which N-gram). See Section 4.3 of Language Modeling with Ngrams from Speech and Language Processing (SPL3) for a brief discussion.

We will learn general techniques to solve smoothing as part of more general estimation techniques in Lecture 4. See Section 4.4 of Language Modeling with Ngrams from Speech and Language Processing (SPL3) for a presentation of the classical smoothing techniques (Laplace, add-k).