Questions: KL Divergence

4 questions to test your understanding

Score: 0 / 4
Question 1 Multiple Choice

A language model Q assigns probability 0.01 to a word that actually occurs with probability 0.25 in the true distribution P. How does this specific word contribute to D_KL(P || Q)?

A0.25 * log2(0.25 / 0.01) ≈ 1.16 bits — a large contribution because Q severely underestimates this word's probability
B0.01 * log2(0.01 / 0.25) — a negative contribution because Q assigns too little probability
Clog2(0.25 / 0.01) ≈ 4.64 bits — unweighted by the true probability
D0.25 * log2(0.01 / 0.25) — a negative value that reduces the divergence
Question 2 True / False

KL divergence is a proper distance metric between probability distributions.

TTrue
FFalse
Question 3 Multiple Choice

In variational inference, we minimize D_KL(q || p) where q is an approximate posterior and p is the true posterior. Why does this tend to produce approximations q that are more concentrated (mode-seeking) than the true posterior?

AMinimizing D_KL(q||p) penalizes q for placing mass where p has low density, so q avoids the tails and concentrates on a single mode
BD_KL(q||p) is always smaller than D_KL(p||q), forcing q to be narrower
CVariational inference uses gradient descent, which naturally converges to point estimates
DThe KL divergence is symmetric, so the direction does not matter
Question 4 Short Answer

Explain the relationship between KL divergence and mutual information. How is I(X;Y) expressed as a KL divergence, and what does this representation reveal?

Think about your answer, then reveal below.