Questions: Information Theory and Statistical Inference
4 questions to test your understanding
Score: 0 / 4
Question 1 Multiple Choice
The Kullback-Leibler divergence D_KL(p||q) = sum_x p(x) log(p(x)/q(x)) measures information lost when using q to approximate p. Why does maximum likelihood estimation (MLE) asymptotically minimize D_KL(p||q_theta)?
AMLE minimizes the likelihood, which is the reciprocal of KL divergence
BMLE maximizes log p_theta(data), and by the law of large numbers, this is equivalent to minimizing D_KL(empirical distribution || p_theta), which bounds D_KL(true p || p_theta)
CMLE is defined to minimize KL divergence by design
DKL divergence and likelihood are not related
Given n samples from true distribution p, the empirical distribution p_emp puts mass 1/n on each sample. For large n, p_emp converges to p (by the law of large numbers). The log-likelihood is sum_i log p_theta(x_i) = n * E_emp[log p_theta(x)]. MLE maximizes this, equivalent to maximizing E_emp[log p_theta(x)]. This is equivalent to minimizing D_KL(p_emp || p_theta) = E_emp[log(p_emp(x)/p_theta(x))] = E_emp[log p_emp] - E_emp[log p_theta]. The first term (empirical entropy) doesn't depend on theta, so minimizing KL w.r.t. theta is equivalent to maximizing likelihood. As n increases, p_emp approaches p, so MLE approaches the solution that minimizes D_KL(p || p_theta). This is how information theory unifies MLE as KL divergence minimization.
Question 2 True / False
The Cramer-Rao bound states that the variance of any unbiased estimator of theta is lower-bounded by 1/F(theta), where F is Fisher information. This bound is information-theoretic: it relates curvature of the likelihood landscape to estimation precision.
TTrue
FFalse
Answer: True
Fisher information F(theta) = E[(d/d_theta log p(X|theta))^2] measures how much the log-likelihood curvature around theta. High curvature means small changes in theta create large changes in the likelihood — the data is sensitive to theta, allowing precise estimation. Low curvature means the likelihood is flat — the data are insensitive to theta, making estimation imprecise. The Cramer-Rao bound formalizes this: no estimator (biased or unbiased) can achieve variance smaller than 1/F(theta), a fundamental limit set by the information in the data. The bound is tight for exponential families and certain other models; maximum likelihood estimation often achieves the bound asymptotically.
Question 3 Short Answer
Explain how the error exponent in binary hypothesis testing (Neyman-Pearson setting) is related to the Kullback-Leibler divergence between the two hypotheses.
Think about your answer, then reveal below.
Model answer: In binary hypothesis testing, we have null hypothesis H0 (distribution p) versus alternative H1 (distribution q). A test error occurs when we reject H0 given q, or fail to reject given p. The Chernoff exponent gives the rate at which error probability decays with sample size n: P(error) ~ exp(-n*E*), where E* is the Chernoff information, defined as E* = min_{0 < beta < 1} [beta*D_KL(p||q) + (1-beta)*D_KL(q||p)]. This is a weighted average of the KL divergences between the two hypotheses. When p and q are far apart (large KL divergence), E* is large and errors decay rapidly (strong separation). When p and q are close (small KL), E* is small and errors decay slowly (weak separation). The optimal test (which achieves the Chernoff exponent) uses a likelihood ratio: accept the hypothesis with higher likelihood.
This fundamental result shows that hypothesis testing error is fundamentally limited by how much information the samples provide about which hypothesis is true — quantified by KL divergence. No test can beat the Chernoff exponent; many practical tests (likelihood ratio) achieve it.
Question 4 Multiple Choice
The Akaike Information Criterion (AIC) = -2*log-likelihood + 2*k penalizes model complexity by 2k. In what sense is AIC an 'information criterion'?
AAIC measures information content of the model parameters
BAIC approximates the KL divergence between the true distribution and the fitted model, plus a penalty for overfitting. It balances likelihood (KL divergence) and complexity, derived from information theory
CAIC is based on Shannon entropy directly
DAIC has no connection to information theory
AIC derives from information theory through the connection between MLE and KL divergence minimization. For large samples, AIC approximately equals n*(minimum KL divergence) + 2k. Minimizing AIC trades off likelihood (KL divergence to true distribution) and model complexity (k). The factor 2 (in 2*k) comes from an information-theoretic calculation: under model misspecification, the penalty for adding one parameter is approximately 2 in likelihood terms. AIC is used when you're comparing models that may be misspecified (none is true). BIC = -2*log-likelihood + k*log(n) is another information-based criterion that emerges from a Bayesian information-theoretic perspective and penalizes complexity more severely (log n >> 2 for large n).