The Kullback-Leibler divergence D_KL(P || Q) = sum p(x) log(p(x)/q(x)) measures how much one probability distribution P differs from a reference distribution Q, in units of information. It quantifies the extra bits needed to encode samples from P using a code optimized for Q. KL divergence is always non-negative (Gibbs' inequality), equals zero only when P = Q, and is not symmetric: D_KL(P||Q) != D_KL(Q||P). It is the central tool for comparing distributions in information theory, statistics (likelihood ratio tests), and machine learning (variational inference, training generative models).
You have seen that mutual information measures how much two random variables share. KL divergence is the more general tool: it measures how one probability distribution differs from another, and mutual information turns out to be a special case. D_KL(P || Q) = sum over x of p(x) log(p(x)/q(x)) answers: if nature generates data from P, but I designed my encoding assuming Q, how many extra bits per symbol do I waste?
The asymmetry of KL divergence is not a defect — it reflects a real distinction. D_KL(P || Q) measures the cost of using Q when the truth is P. D_KL(Q || P) measures the cost of using P when the truth is Q. These are different situations. In variational inference, minimizing D_KL(q || p) (the "forward" or "exclusive" KL) makes q avoid regions where p is small, producing compact, mode-seeking approximations. Minimizing D_KL(p || q) (the "reverse" or "inclusive" KL) makes q cover all regions where p is large, producing diffuse, mean-seeking approximations. The choice of direction fundamentally shapes the behavior of the approximation.
Gibbs' inequality states that D_KL(P || Q) >= 0 for all distributions P and Q, with equality if and only if P = Q. This is perhaps the most important inequality in information theory. It implies that the entropy H(P) = -sum p(x) log p(x) is the minimum average code length for distribution P — any other distribution Q used for coding adds at least D_KL(P || Q) extra bits. Gibbs' inequality also immediately proves that mutual information is non-negative, since I(X;Y) = D_KL(p(x,y) || p(x)p(y)) >= 0.
KL divergence appears throughout modern machine learning. Cross-entropy loss, the standard training objective for classification, equals H(P) + D_KL(P || Q), where P is the true label distribution and Q is the model's predicted distribution. Minimizing cross-entropy is equivalent to minimizing KL divergence (since H(P) is constant). The evidence lower bound (ELBO) in variational autoencoders involves a KL term. GANs minimize divergences between real and generated distributions. Understanding KL divergence — its asymmetry, its non-negativity, its operational meaning as wasted bits — is essential for reasoning about any system that compares probability distributions.