Basic Concepts in Probability and Information Theory

This lecture summarizes basic terminology and concepts in statistics and probabilities and how they can be applied to model a specific linguistic task: parts of speech tagging. The following notes from this course introduce terminology in a concise and useful manner: Statistical Methods for NLP, Goteborg's University:
  1. Probability Theory
  2. Random Variables
  3. Information Theory
  4. Statistical Inference
  5. Markov Models
  6. Parts of Speech Tagging: Statistical Models
Other excellent tutorials on Machine Learning are available in repository of tutorials by Andrew Moore (Carnegie Mellon University) (also available here).

In particular, read:

  1. Probabilistic and Bayesian Analytics
  2. Cross-validation for detecting and preventing overfitting

In the following, we summarize Chapter 1 of Pattern Recognition and Machine Learning by Chris Bishop to get an operational feeling of how probabilities, information theory and decision theory help us design computational models of machine learning.

Probability Theory

Probability theory allows us to infer quantified relations among events in models that capture uncertainty in a rational manner. A probability function assigns a level of confidence to "events". The axiomatic formulation includes simple rules. Consider we are running an experiment, and this experiment can have n distint outcomes. We want to be able to discuss our level of certainty about the results of the experiment.

Consider the set Ω of outcomes. An event is defined as a subset of outcomes from Ω -- there are 2n events over Ω. Define a function p that meets the following conditions:

p: 2Ω → [0..1] (that is, for all event e, 0 ≤ p(e) ≤ 1)
p(A ∪ B) = p(A) + p(B) - p(A ∩ B)
p(∅) = 0
p(Ω) = 1

Random Variables

We map the presentation of uncertainty in terms of outcomes and events to that of random variable. The value of a random variable corresponds to the possible outcomes of the experiment. We will consider real-valued random variables. In NLP, we will often consider random variables representing the experiment of choosing a word within a vocabulary or choosing a sentence within a language. In this case, the random word is modeled as the index of the word within the vocabulary of possible words. We will often use an alternative model using an indicator vector, which is a vector of boolean values (0 or 1) of dimension N = size of the vocabulary and in which exactly one element is 1 and all the others are 0, with the 1 appearing at the position of the selected word into a vocabulary. A "random sentence" may be parameterized as a vector of random words.

The computations predicted by probabilities become interesting and sometimes surprising when we consider complex experiments, involving multiple random variables. The dependency relations among the variables are reflected in the computations using simple rules -- called "sum and product rules of probability".

Consider the classical example of boxes and balls with Bishop's figure:

Red box with 6 oranges / 2 apples and Blue box with 1 orange and 3 apples

In this setting, we are considering a "structured experiment" which proceeds in 2 stages: first pick a box (either red or blue), then pick a ball within the selected box (either orange or green). Assume we repeat this experiment many times, and assume we get the following aggregated observations:

  1. p(red box) = 40%
  2. p(blue box) = 60%
  3. All balls in the box are equi-probable (can be picked equally likely)
We model this experiment with 2 random variables:
  1. B ∈ {blue, red} (the box can be either blue or red)
  2. F ∈ {orange, green} (the ball can be either orange or green)
When we consider 2 variables, we can think of the outcomes of the experiment as a matrix, where the cells measure the number of outcomes for the joint occurrence of each of the variables, as illustrated here:

Matrix view of 2 random variables

In this view, we can define several interesting quantities (assuming we ran the experiment N times, and N → ∞):

From these definitions, we derive that:
p(X = xi, Y = yj) = p(Y = yj | X = xi)p(X = xi)
p(X = xi, Y = yj) = p(X = xi | Y = yj)p(Y = yj)
and in general, the following relation known as Bayes rule:
p(Y | X) = p(X | Y) p(Y) / p(X)
p(X) = ΣYp(X|Y)p(Y)
These simple rules can provide interesting results - as applied to our box and balls example:
p(B = r) = 4/10
p(B = b) = 6/10

p(F = g | B = r) = 1/4
p(F = o | B = r) = 3/4

p(F = g | B = b) = 3/4
p(F = o | B = b) = 1/4
We can now compute the probability that a ball is orange or green based on the complex experiment:
p(F = o) = p(F = o | B = r)p(B = r) + p(F = o | B = b)p(B = b) = 9/20 = 0.45
p(F = g) = 11/20 = 0.55
Compare this result with the uniform probability we would obtain if we ignore the structure of the boxes: there are overall 7 orange balls and 5 green balls. The uniform probabilities would have been p(F = o) = 7/12 = 0.58 -- because the boxes introduce structure over the balls, they significantly affect the likelihood of picking an orange or a green ball (p(o) changes from 0.58 to 0.45).

Another non-trivial computation is obtained by applying Bayes rule. Bayes rule in general is used to "reverse dependency". Consider the following observation: we run the experiment and we obtain an orange ball. What is the likelihood that the ball was picked from the red box?

p(B = r | F = o) = p(F = o | B = r)p(B = r)/p(F = o) = 2/3
One way to read this result is as follows: if we do not have any additional data, then the probability that the red box has been picked is 40%. But if we are told that an orange ball has been picked, then we can revise our estimation -- and we can now believe that the red box has been picked with 67% confidence. That is, we updated our confidence because we were given additional data - from 40% to 67%.

We say that two random variables are independent if the following relation holds:

p(X,Y) = p(X)p(Y)
In terms of information obtained by running an experiment, independence means that knowing the value of Y does not reduce the uncertainty about X.

We can extend this important relation to conditional independence:

X and Y are conditionally independent in Z iff:
p(X,Y|Z) = p(X|Z)p(Y|Z)

Continuous case: Probability densities

So far we worked over a discrete set of events over a discrete set of outcomes Ω. Probability functions can be extended over continuous variables (real numbers). In this case, we will move from discrete sums (Σ) to integrals and the probability function is defined in terms of a probability density function (pdf) p such that:
p(x ∈ [a..b]) = ab p(x)dx
The normalization condition on a pdf is:
∀ x, p(x) ≥ 0
-∞+∞ p(x)dx = 1
We extend the definition of pdfs to multiple variables and the marginal and conditional probabilities as follows:
p(x) =  p(x,y)dy
p(x,y) = p(y|x)p(x)

Expectation and covariance

Given a probability function, we can use it to compute the weighted average of a function. In this case, we use the terminology of a probability distribution p(x) which determines how likely each value of x is over a range of possible values. We can then define the expectation of f relative to p as:
Ex(f) = Σx p(x)f(x)  for the discrete case
Ex(f) = x p(x)f(x)dx  for the continuous case
An important operation over a probability distribution is that of sampling: sampling is the operation of randomly choosing a value x out of the range of values which x can take so that if we keep sampling many times (N), we observe each value x about p(x) times out of N.

If we sample x many times and obtain values (x1...xN), then the following approximation holds:

E(f) ≅ 1/N Σn=1Nf(xn)
Expectation can also be extended to multiple variables, with the expected generalization for conditional expectation:
Ex[f(x,y)] = Σxf(x,y)  (This is a function of y)
Ex[f | y] = Σx p(x | y) f(x)
The variance of f is defined as:
var(f) = E[f(x) - E(f(x))2] = E[f(x)2] - E[f(x)]2
Similarly, the variance can be extended to multiple variables, and we obtain the definition of covariance:
cov[x,y] = Ex,y[{x - E(x)}{y - E(y)}] = Ex,y[xy] - E[x]E[y]
Expectation and variance are useful summaries of a complex probability distribution -- they tell us in 2 numbers something useful about a full distribution -- what is the mean value and how much variability there is around the mean. Often, we want to model the following situation: we know that a quantity has been measured with some level of noise; in such a case, we are looking for a way to describe what we know about the quantity. One way is to state that the actual quantity has a certain expected value (derived from the measurement) with some level of variance (corresponding to the accuracy of the measurement method). In this case, the 2 quantities that characterize the possible distribution of the quantity are its expectation and its variance. If we know more about the measurement process, we can suggest a finer distribution model. Otherwise, we use a standard distribution model (typically, we use the Gaussian distribution in such cases as described below).

What can be done with a distribution

Consider a distribution as an "object" in a programming language. What interface should the distribution object implement? In other words, what are the useful operations one can perform on a distribution?

The following are the key operations:

These operations are the key ingredients used by machine learning methods. They can be combined in many ways to solve problems of regression (guessing a continuous function given samples of the form (input, target)) and classification (guessing a discrete function mapping input values to discrete classes given a set of examples.

Likelihood and Maximum Likelihood Estimation

The basic method of estimation is based on Bayes theorem:
p(w|D) = p(D|w)p(w)/p(D)
Consider a distribution p, characterized by parameters w. We are interested in estimating the parameters w given an observed dataset D.

In the Bayes equation, we use the following terminology:

Given this terminology, we restate that Bayes theorem as:
posterior ∝ likelihood x prior
Not that we ignore the normalizing term p(D) which is a constant in w. If we are interested in computing p(D), we can use the following approach:
p(D) =  p(D|w)p(w)dw

The frequentist approach to estimation consists in finding the best possible value for w given an observation D. In this case, the normalization constant p(D) plays no role, and the output is an optimal value for w, w*. The Bayesian approach instead tries to estimate the distribution over w, p(w|D). This requires the computation of the normalization constant p(D) which can be expensive.

In general, Bayesian approaches to estimation are most useful if the task we perform (regression or classification) is one step within a multi-step process. The Bayesian approach is preferable in this case because the output of the classification for example, does not need to choose the most likely value for w, but instead can let further steps choose among multiple options that may be likely and can be determined based on features which are not considered at the time of the classification. For example, consider the task of syntactic parsing. Parsing often uses as features the parts of speech of each word. Therefore, the classification task of parts of speech tagging is one step in the 2-step process of parsing. The frequentist approach will compute the most likely POS tag for each word before parsing starts. The Bayesian approach will instead compute a distribution over possible POS tags for each word, which can be later used by the parser.

In summary, estimation can proceed in 3 different ways, of increasing complexity:

  1. Maximum likelihood: find the parameter w* which yields the maximum likelihood. This ignores the prior term p(w) and the normalization constant p(D) in the Bayes equation. Maximum likelihood estimators have a tendency to over-fit the data.
  2. Maximum likelihood with prior: find the parameter w* which yields the maximum value for the composed term likelihood x prior, assuming a given prior distribution over w. This has the effect of avoiding overfitting and is similar to regularization. This ignores the normalization constant p(D) in the Bayes equation.
  3. Bayesian estimation: find the posterior distribution p(w|D) by computing the full right-hand side of Bayes equation. This is the most informative way to estimate, but is computationally expensive.

Useful Distributions

A small set of simple probability distributions play a central role in probability theory. They are useful because they can be used as building blocks to construct more complex distributions in a modular manner. We review 3 such distributions here: Bernouilli, multinomail and normal (or Gaussian) distributions. These distributions are all characterized by a small set of parameters and one can think of our work in the following as designing a basis to describe many useful distributions by combining simple basic distributions each with its own parameters.

Bernouilli binary distribution

The Bernouilli distribution models the simplest experiment one can think of: throw a random coin and record the outcome as either 0 or 1. The distribution is characterized by a parameter μ which is the probability of obtaining 1:
p(x=1|μ)=μ  where 0 ≤ μ ≤ 1
p(x=0|μ)=1-μ
We will summarize these 2 terms with a single expression, which will be useful to formulate sums and expectations over the Bernouilli distribution:
Bern(x|μ)=μx(1-μ)(1-x)
We can then verify that:
E(p) = μ
var(p) = μ(1-μ)
Consider a dataset of N observations drawn independently from the Bernouilli distribution D=(x1...xN), then we can consider the likelihood of this dataset, which we consider as a function of μ for D given:
p(D|μ) = n=1Np(xn|μ) =  μxn(1-μ)(1-xn)

and the log likelihood is:

ln p(D|μ) = Σ ln p(xn|μ) = Σ {xnlnμ + (1-xn)ln(1-μ)}

By computing the derivation of the log likelihood as a function of μ we derive that the likelihood reaches its maximum on:

μML = 1/N Σ xn
As usual for mle, this mle estimate severely overfits the distribution when the dataset is small.

A close variant of the Bernouilli distribution is called the binomial distribution: it describes the distribution of the number m of observations of x=1 of a Bernouilli experiment. Since the m draws are independent of each other, we derive that:

Bin(m|N,μ) ∝ μm(1-μ)N-m
To normalize this (that is, to make sure that the sum of Bin(m) over all possible m values from 0 to N is equal to 1), we must count all the distinct ways to obtain m draws of 1s out of N draws. This number is (N choose m) noted (N m) = N!/(N-m)!m! We derive the following properties for the binomial distribution:
Bin(m|N,μ) = (N m) μm(1-μ)N-m

E[m] = Nμ
var[m] = Nμ(1-μ)
If we want to compute a Bayesian estimate of Bernouilli experiments, we must select a prior for the distribution of the parameter μ. We choose this in a way that is intuitively appealing (we can say something understandable about the expected value of μ and its variance) and analytically convenient (we can easily integrate the prior). To this end, note that the likelihood function of the Bernouilli distribution has a form which is proportional to μx(1-μ)1-x. If we choose the prior to also be a power of μ and (1-μ), then the posterior distribution will also have the same form as the prior. This is convenient because it means the posterior distribution will belong to the same family of functions as the prior, and the Bayesian update procedure will consist of only updating the parameters of the prior distribution. This convenient property is called conjugacy: we identify the conjugate prior of the binomial distribution to be:
Beta(μ|a,b) = Γ(a+b) / Γ(a)&Gamma(b) μa-1(1-μ)b-1

where Γ(x) =  ux-1e-udu
One can verify that: Γ(x+1) = x Γ(x) that is, Γ(x) generalizes the factorial function over non-integer numbers. The shape of the Beta distribution for various values of the hyper-parameters a and b is shown below:
The mean and variance of the Beta distribution are:
E[μ] = a / (a+b)
var[μ] = ab / (a+b)2(a+b+1)
If we choose a Beta conjugate prior for μ, we derive from Bayes theorem:
p(μ|m,l,a,b) ∝ μm+a-1(1-μ)l+b-1

where m is the number of 1s and l = N-m is the number of 0s observed in the dataset D,
and a and b are hyper-parameters that determine our prior knowledge about p(μ).
If we observe this distribution, we realize it has the form of a Beta distribution as well (which is exactly what we wanted to achieve by choosing a conjugate prior).
The posterior of the binomial given a dataset with (m draws of 1, l=N-m draws of 0)
and a prior Beta(a,b) is: Beta(a+m, b+l)
We can interpret this using a notion of "pseudo-counts": we stated our prior knowledge about μ by saying it follows a Beta distribution with hyper-parameters a and b -- the parameters a and b correspond intuitively to the counts of observations we "think we have seen in the past". New observations correct the a priori belief of the pseudo-counts a and b by merging in the observed actual counts m and l. In contrast to the MLE estimator, the update of the counts is "smoothed" -- we do not jump to the conclusion that the distribution goes according to m and l (the MLE estimate corresponds to the prior Beta(m,l)), but smoothly merge in (a,b) into (a+b, m+l).

Another way of saying this is to imagine a sequential Bayesian update process: assume that instead of updating our posterior belief for a whole dataset of N observations at once, we update our belief sequentially, for each observation point, one by one. In this way, our prior can start from a uniform distribution (which happens to be Beta(1,1)), and for each datapoint, we update the distribution by adding one to either the a hyper-parameter (if the datapoint is 1) or to the b hyper-parameter (if the datapoint is 0).

This process of sequential update is illustrated graphically below:
The graph on the left shows the prior distribution over the μ parameter: it indicates a "rough" preference for a fair Bernouilli experiment (E(μ)=0.5) with high variance (low certainty) encoded as a Beta(2,2) prior distribution. The likelihood function is shown in the center diagram, corresponds to a single observation of N=m=1 (in which case, the MLE estimate is μ = 1). The posterior updates the prior distribution from Beta(2,2) to Beta(3,2), which shifts smoothly from 0.5 towards 1, and has lower variance than the original Beta(2,2) -- but still does not "jump" to the extreme overfitted conclusion that μ=1.

Multinomial Distribution

Multinomial variables are the extension of the binomial from binary outcomes to n-ary outcomes. The typical example is throwing a dice with K sides, the outcome can be a number between 1 and K. This can be represented in different ways, but one of the easiest way is to encode the outcome of the experiment as an indicator vector of K dimensions, that is a vector of K dimensions with exactly one value equal to 1 and all others to 0
v = (0 0 ... 0 1 0 ... 0): vi = 1 and vj=0 for all j≠i for i,j ∈ (1..K)
encodes the outcome i out of the 1 to K possible outcomes.
Note that in this encoding, Σxk = 1. Note p(xk = 1) = μk. Then the distribution of x is:
p(x | μ) = μkxk

where μ = (μ1, ..., μK) is the parameter of the distribution and

Σμk = 1

E(x|μ) = Σp(x|μ)x = μ
If we consider the case where we repeat the trials N times, then the distribution is known as the multinomial, and the outcome is a vector with K dimensions, m = (m1...mK) such that Σmi = N. Then the distribution is:
Mult(m1, m2, .., mK | N, μ) = (N choose  m1 ... mK)μkmk
      = N! / m1! ... mK!μkmk
Where (N choose m1 ... mK) is the number of ways one can partition N objects into K groups of size m1 ... mK.

Consider the way we can compute the likelihood of a dataset D = (x1 ... xn) where each xi = (xi1 ... xiK) is the outcome of a trial. Then the likelihood of the dataset D is given as follows:

likelihoodD(μ) = p(D|μ) = n=1..N k=1..K μkxnk = k=1..K μΣxnk
Call mk = Σn=1..Nxnk. mk corresponds to the number of times the outcome of the experiment is k -- with Σk=1..Kmk=N. We see that the likelihood of D only depends on (mk) and not on the full details of the distribution xnk. We say in such a case that the K numbers mk form sufficient statistics for the likelihood.
likelihoodD(μ) = p(D|μ) = k=1..K μkmk

log-likelihood(μ) = Σk=1..K mk ln(μk)
For Bayesian inference, we introduce a conjugate prior on the multinomial which is a direct generalization of the Beta distribution introduced for the binomial distribution. This conjugate distribution is called the Dirichlet distribution and is defined as:
Dirichlet(μ | α) = Γ(α0) / Γ(α1) ... Γ(αK) μkαk-1

where α0 = Σαk
Going through computation similar to the case of the binomial, we obtain the step of Bayesian inference over a multinomial distribution as follows:
p(μ | D, α) ∝ p(D | μ) p(μ | α) ∝ μkαk+mk-1

and hence, exploiting the conjugacy of the Dirichlet prior:

p(μ | D, α) = Dirichlet(μ | α+m)
Each time one draws from the Dirichlet distribution, we obtain the parameters of a multinomial distribution. The following tutorial on Structured Bayesian Nonparametric Models with Variational Inference by Percy Liang and Dan Klein (2007) provides excellent intuitive illustration of the Dirichlet distribution.

You can also read more on conjugate priors in Julia Hockenmaier's class notes (with Dirichlet's presentation as well).


Last modified May 13th, 2012