Questions — KL Divergence — Open Knowledge Graph

Question 1 Multiple Choice

A language model Q assigns probability 0.01 to a word that actually occurs with probability 0.25 in the true distribution P. How does this specific word contribute to D_KL(P || Q)?

A0.25 * log2(0.25 / 0.01) ≈ 1.16 bits — a large contribution because Q severely underestimates this word's probability

B0.01 * log2(0.01 / 0.25) — a negative contribution because Q assigns too little probability

Clog2(0.25 / 0.01) ≈ 4.64 bits — unweighted by the true probability

D0.25 * log2(0.01 / 0.25) — a negative value that reduces the divergence

Question 2 True / False

KL divergence is a proper distance metric between probability distributions.

TTrue

FFalse

Question 3 Multiple Choice

In variational inference, we minimize D_KL(q || p) where q is an approximate posterior and p is the true posterior. Why does this tend to produce approximations q that are more concentrated (mode-seeking) than the true posterior?

AMinimizing D_KL(q||p) penalizes q for placing mass where p has low density, so q avoids the tails and concentrates on a single mode

BD_KL(q||p) is always smaller than D_KL(p||q), forcing q to be narrower

CVariational inference uses gradient descent, which naturally converges to point estimates

DThe KL divergence is symmetric, so the direction does not matter

Question 4 Short Answer

Explain the relationship between KL divergence and mutual information. How is I(X;Y) expressed as a KL divergence, and what does this representation reveal?

Think about your answer, then reveal below.

Questions: KL Divergence