Computing KL Divergence

As explained in the Wikipedia article on the Kullback-Leibler divergence, KL measures the divergence between a probability distribution P and Q.

Think of P as (x_1: p_1, x_2: p_2, ..., x_n:p_n) with sum(p_i) = 1.0 and Q as (y_1:q_1, ...., y_n:q_n) with sum(q_j) = 1.0. Assume for now that P and Q are defined over the same outcomes x_i - we will discuss the possibility for differences below: Then the definition of KL is:

KL(P,Q) = sum(i = 1..n) [ p_i * log( p_i / q_i) ]
When we try to compute this formula, we must address two questions:
  1. What to do if qi = 0 for some i or pi = 0 for some i?
  2. How do we define the formula when P and Q are defined over different samples?
The direct answer is that we always consider that: 0 * log(0) = 0.

This handles the case of pi=0, and when both pi and qi are 0 (since log(pi/qi) = log(pi) - log(qi)). But not the case when pi!=0 and qi=0. The general definition is that in such a case, the divergence is infinite. This means that if one distribution predicts that an event is possible (p(e)>0) and the other predicts it is absolutely impossible (q(e)=0), then the two distributions are absolutely different.

However, in our context, the answer is slightly different. P and Q are derived from observations and sample counting -- that is, P and Q are probability distributions derived from frequency distributions. In this case, the solution is that we should never predict in the derived probability distribution that an event is completely impossible: when we derive the probability distribution, we must take into account the possibility of unseen events. The derivation of a probability distribution from an observed frequency distribution is called smoothing.

Since there are infinitely many unseen events, this makes it difficult to compare frequency distributions that predict non-zero probability for unseen events. But when we want to compare 2 distributions, one quick and easy way is to apply absolute discounting as a smoothing method.

Consider the following example:

P: a:1/2, b:1/4, c:1/4
Q: a:7/12, b:2/12, d:3/12
Absolute discounting would work as follows: Define:
  1. a small constant eps (for example: eps=0.0001).
  2. SP = {a, b, c} the samples observed in P.
  3. CP = |SP| = 3, the number of samples observed in P.
  4. Similarly, SQ = {a, b, d}
  5. CQ = 3
  6. SU = SP U SQ = {a, b, c, d} - all the observed samples.
  7. CU = |SU| = 4
Given these definitions, we can define the following smoothed version of P and Q:
  1. P'(i) = P(i) - pc if i in SP
  2. P'(i) = eps otherwise for i in SU - SP
and similarly for Q' and where pc and qc are computed so that sum(P'(i)) = 1.0 and sum(Q'(j)) = 1.0.

In our example, we obtain:

P': a:1/2-pc, b:1/4-pc, c:1/4-pc, d: eps
    pc = eps/3
Q': a:7/12-qc, b:2/12-qc, c: eps, d:3/12-qc
    qc = eps/3
In general, pc = eps*|SU-SP|/|SP| and qc = eps*|SU-SQ|/|SQ|.

We can then safely compute KL(p', q') and KL(q', p') without obtaining infinite divergence.