Computing KL Divergence
As explained in the Wikipedia article on the Kullback-Leibler divergence,
KL measures the divergence between a probability distribution P and Q.
Think of P as (x1: p1, x2: p2, ..., xn:pn) with Σpi = 1.0
and Q as (y1:q1, ...., yn:qn) with Σqj = 1.0.
Assume for now that P and Q are defined over the same outcomes xi - we will discuss the possibility for differences below:
Then the definition of KL is:
KL(P,Q) = Σi = 1..n [ pi * log( pi / qi) ]
When we try to compute this formula, we must address two questions:
- What to do if qi = 0 for some i or pi = 0 for some i?
- How do we define the formula when P and Q are defined over different samples?
The direct answer is that we always consider that: 0 * log(0) = 0.
This handles the case of pi=0, and when both pi and qi are 0 (since log(pi/qi) = log(pi) - log(qi)).
But not the case when pi!=0 and qi=0. The general definition is that in such a case, the divergence is infinite.
This means that if one distribution predicts that an event is possible (p(e)>0) and the other predicts it is
absolutely impossible (q(e)=0), then the two distributions are absolutely different.
However, in our context, the answer is slightly different. P and Q are derived from observations and sample counting -- that is, P and Q are
probability distributions derived from frequency distributions. In this case, the solution is that we should never predict in the derived
probability distribution that an event is completely impossible: when we derive the probability distribution, we must take into account the possibility of unseen events. The derivation of a probability distribution from an observed frequency distribution is called smoothing.
Since there are infinitely many unseen events, this makes it difficult to compare frequency distributions that predict non-zero probability
for unseen events. But when we want to compare 2 distributions, one quick and easy way is to apply absolute discounting as a smoothing
method.
Consider the following example:
P: a:1/2, b:1/4, c:1/4
Q: a:7/12, b:2/12, d:3/12
Absolute discounting would work as follows: Define:
- a small constant eps (for example: eps=0.0001).
- SP = {a, b, c} the samples observed in P.
- CP = |SP| = 3, the number of samples observed in P.
- Similarly, SQ = {a, b, d}
- CQ = 3
- SU = SP U SQ = {a, b, c, d} - all the observed samples.
- CU = |SU| = 4
Given these definitions, we can define the following smoothed version of P and Q:
- P'i = Pi - pc if i in SP
- P'i = eps otherwise for i in SU - SP
and similarly for Q' and where pc and qc are computed so that ΣP'i = 1.0 and ΣQ'j = 1.0.
In our example, we obtain:
P': a:1/2-pc, b:1/4-pc, c:1/4-pc, d: eps
pc = eps/3
Q': a:7/12-qc, b:2/12-qc, c: eps, d:3/12-qc
qc = eps/3
In general, pc = eps*|SU-SP|/|SP| and qc = eps*|SU-SQ|/|SQ|.
We can then safely compute KL(p', q') and KL(q', p') without obtaining infinite divergence.