A language model Q assigns probability 0.01 to a word that actually occurs with probability 0.25 in the true distribution P. How does this specific word contribute to D_KL(P || Q)?
A0.25 * log2(0.25 / 0.01) ≈ 1.16 bits — a large contribution because Q severely underestimates this word's probability
B0.01 * log2(0.01 / 0.25) — a negative contribution because Q assigns too little probability
Clog2(0.25 / 0.01) ≈ 4.64 bits — unweighted by the true probability
D0.25 * log2(0.01 / 0.25) — a negative value that reduces the divergence
Each term in D_KL(P||Q) is p(x) * log(p(x)/q(x)). For this word: 0.25 * log2(0.25/0.01) = 0.25 * log2(25) = 0.25 * 4.64 ≈ 1.16 bits. The contribution is large and positive because Q dramatically underestimates a common word. KL divergence heavily penalizes cases where Q assigns low probability to events that P considers likely — this is why mode-dropping in generative models (Q missing modes of P) is so costly in the KL sense.
Question 2 True / False
KL divergence is a proper distance metric between probability distributions.
TTrue
FFalse
Answer: False
KL divergence is NOT a metric. It fails two requirements: (1) it is not symmetric — D_KL(P||Q) != D_KL(Q||P) in general, and (2) it does not satisfy the triangle inequality. It is sometimes called a 'divergence' or 'relative entropy' specifically to avoid the word 'distance.' However, it has deep connections to actual metrics: the symmetrized KL (D_KL(P||Q) + D_KL(Q||P)) and the square root of the Jensen-Shannon divergence (which IS a metric) are commonly used alternatives.
Question 3 Multiple Choice
In variational inference, we minimize D_KL(q || p) where q is an approximate posterior and p is the true posterior. Why does this tend to produce approximations q that are more concentrated (mode-seeking) than the true posterior?
AMinimizing D_KL(q||p) penalizes q for placing mass where p has low density, so q avoids the tails and concentrates on a single mode
BD_KL(q||p) is always smaller than D_KL(p||q), forcing q to be narrower
CVariational inference uses gradient descent, which naturally converges to point estimates
DThe KL divergence is symmetric, so the direction does not matter
D_KL(q||p) = sum q(x) log(q(x)/p(x)). Where q(x) > 0 but p(x) ≈ 0, the log ratio explodes, creating a huge penalty. So q learns to avoid placing mass anywhere p does not — it 'fits inside' p. For a multimodal p, q will typically collapse to a single mode rather than spread across all modes. The reverse KL, D_KL(p||q), has the opposite behavior: it penalizes q for assigning low probability where p is high, producing moment-matching (mean-seeking) approximations that cover all modes but may be too diffuse.
Question 4 Short Answer
Explain the relationship between KL divergence and mutual information. How is I(X;Y) expressed as a KL divergence, and what does this representation reveal?
Think about your answer, then reveal below.
Model answer: Mutual information is the KL divergence between the joint distribution and the product of marginals: I(X;Y) = D_KL(p(x,y) || p(x)p(y)). This reveals that mutual information measures how far X and Y are from being independent — it is the information cost of wrongly assuming independence when the variables are actually dependent. If X and Y are independent, the joint equals the product of marginals, the KL divergence is zero, and I(X;Y) = 0. This representation also makes it clear why mutual information is always non-negative: it inherits this from Gibbs' inequality (D_KL >= 0).
This connection unifies two fundamental concepts: KL divergence as a measure of distributional difference, and mutual information as a measure of statistical dependence. Many other information-theoretic quantities (conditional mutual information, information gain in decision trees, ELBO in variational inference) can similarly be expressed as KL divergences.