Computing KL Divergence

As explained in the Wikipedia article on the Kullback-Leibler divergence, KL measures the divergence between a probability distribution P and Q.

Think of P as (x₁: p₁, x₂: p₂, ..., x_n:p_n) with Σp_i = 1.0 and Q as (y₁:q₁, ...., y_n:q_n) with Σq_j = 1.0. Assume for now that P and Q are defined over the same outcomes x_i - we will discuss the possibility for differences below: Then the definition of KL is:

KL(P,Q) = Σ_{i = 1..n} [ p_i * log( p_i / q_i) ]

When we try to compute this formula, we must address two questions:

What to do if qi = 0 for some i or pi = 0 for some i?
How do we define the formula when P and Q are defined over different samples?

The direct answer is that we always consider that: 0 * log(0) = 0.

This handles the case of pi=0, and when both pi and qi are 0 (since log(pi/qi) = log(pi) - log(qi)). But not the case when pi!=0 and qi=0. The general definition is that in such a case, the divergence is infinite. This means that if one distribution predicts that an event is possible (p(e)>0) and the other predicts it is absolutely impossible (q(e)=0), then the two distributions are absolutely different.

However, in our context, the answer is slightly different. P and Q are derived from observations and sample counting -- that is, P and Q are probability distributions derived from frequency distributions. In this case, the solution is that we should never predict in the derived probability distribution that an event is completely impossible: when we derive the probability distribution, we must take into account the possibility of unseen events. The derivation of a probability distribution from an observed frequency distribution is called smoothing.

Since there are infinitely many unseen events, this makes it difficult to compare frequency distributions that predict non-zero probability for unseen events. But when we want to compare 2 distributions, one quick and easy way is to apply absolute discounting as a smoothing method.

Consider the following example:

P: a:1/2, b:1/4, c:1/4
Q: a:7/12, b:2/12, d:3/12

Absolute discounting would work as follows: Define:

a small constant eps (for example: eps=0.0001).
SP = {a, b, c} the samples observed in P.
CP = |SP| = 3, the number of samples observed in P.
Similarly, SQ = {a, b, d}
CQ = 3
SU = SP U SQ = {a, b, c, d} - all the observed samples.
CU = |SU| = 4

Given these definitions, we can define the following smoothed version of P and Q:

P'_i = P_i - pc if i in SP
P'_i = eps otherwise for i in SU - SP

and similarly for Q' and where pc and qc are computed so that ΣP'_i = 1.0 and ΣQ'_j = 1.0.

In our example, we obtain:

P': a:1/2-pc, b:1/4-pc, c:1/4-pc, d: eps
    pc = eps/3
Q': a:7/12-qc, b:2/12-qc, c: eps, d:3/12-qc
    qc = eps/3

In general, pc = eps*|SU-SP|/|SP| and qc = eps*|SU-SQ|/|SQ|.

We can then safely compute KL(p', q') and KL(q', p') without obtaining infinite divergence.