Information theory provides fundamental limits and optimality principles for statistical inference: estimation and hypothesis testing. The Kullback-Leibler divergence D_KL(p||q) quantifies the information lost when approximating true distribution p with model q, and arises naturally in maximum likelihood estimation as the asymptotic objective (minimizing KL divergence). The Fisher information I(theta) quantifies the curvature of the likelihood landscape and lower-bounds the variance of any unbiased estimator (Cramer-Rao bound). Hypothesis testing can be framed information-theoretically: the error probability decays exponentially with the number of samples, with rate determined by the Chernoff exponent, which involves KL divergence between competing hypotheses. Information criteria (AIC, BIC, MDL) trade off model fit and complexity using KL divergence or description length. These principles unify estimation, testing, and model selection under a single information-theoretic framework, revealing that all statistical tasks are fundamentally limited by how much information the data provides about unknowns.
Statistical inference — estimating unknown parameters and testing hypotheses from data — appears to have little to do with information theory. Yet information-theoretic concepts and bounds are fundamental to understanding what inference is possible and how well it can be done.
Maximum Likelihood and KL Divergence:
The most common approach to estimation is maximum likelihood: given observed data, find the parameter theta that maximizes the probability of the data. Asymptotically (as the sample size grows), MLE is equivalent to finding the theta that minimizes the Kullback-Leibler divergence D_KL(p_true || p_theta) between the true distribution and the model. This connection is profound: it unifies estimation under the principle of KL divergence minimization. The likelihood function, the score, the Hessian (Fisher information) — all emerge naturally from divergence-theoretic concepts. MLE is asymptotically optimal in several senses: it is consistent (converges to the true parameter), efficient (achieves the Cramer-Rao lower bound asymptotically), and has minimal information loss.
Fisher Information and Estimation Limits:
The Fisher information I(theta) = E[(d/d_theta log p(X|theta))^2] quantifies how sensitive the likelihood is to changes in theta. High Fisher information means the data are informative about theta, enabling precise estimation. The Cramer-Rao bound states that no unbiased estimator can have variance lower than 1/I(theta) per sample. This is a fundamental limit imposed by information theory: the amount of information in the data (quantified by Fisher information) directly constrains estimation precision. MLE achieves the Cramer-Rao bound asymptotically for most models, showing that MLE is not just practical but information-theoretically optimal.
Hypothesis Testing and Error Exponents:
In the Neyman-Pearson setting, we test H0 (data drawn from p) versus H1 (data drawn from q). The optimal test is the likelihood ratio test: accept H1 if the likelihood ratio p(data|q)/p(data|p) exceeds a threshold. The error probability decays exponentially with sample size n: P(error) ~ exp(-n*E*), where E* is the Chernoff information. For simple hypotheses, E* = min_{0<beta<1} [beta*D_KL(p||q) + (1-beta)*D_KL(q||p)]. When p and q are far apart (large mutual KL divergence), E* is large — errors vanish quickly, strong discrimination. When p and q are close, E* is small — weak discrimination. The rate of error decay is fundamentally set by the KL divergence between hypotheses: information-theoretic separation.
Information Criteria and Model Selection:
Comparing models that may be misspecified requires balancing likelihood and complexity. Three major information criteria emerge from information-theoretic principles:
Each reflects a different information-theoretic principle: AIC minimizes expected future KL divergence; BIC selects the best model asymptotically; MDL directly encodes simplicity as short description. For finite samples or model misspecification, they give different answers. Understanding the information-theoretic foundations helps choose the right criterion for a given problem.
Convergence and Sample Complexity:
How many samples do you need to estimate a parameter to a given precision? Information theory answers: at least log(1/epsilon) samples are necessary (where epsilon is the precision), because each sample provides about 1 bit of information. More precisely, the sample complexity depends on the Fisher information: lower information (flatter likelihood landscape) requires more samples. This is why high-dimensional problems are hard: with d parameters, you need at least d information-theoretically just to identify them, and practical estimation requires more.
Learning Theory Connection:
Information theory connects to learning theory (generalization bounds, PAC learning). The number of distinguishable hypotheses from a limited dataset is bounded by the mutual information between the hypothesis class and the observed data. This information-theoretic view unifies estimation error (approximation quality) and generalization error (performance on unseen data). A learner can distinguish at most ~sqrt(d) parameters with n samples from a d-dimensional exponential family, because the information in n samples scales as sqrt(n) when maximally distributed across d parameters.
Information theory reveals that estimation, testing, and model selection are all fundamentally constrained by how much information the data provide about unknowns. The field of statistical inference, when viewed through the lens of information theory, is unified: all tasks are limited by Shannon's bounds, and optimal methods minimize KL divergence or maximize mutual information. This perspective has practical implications — it guides algorithm design, explains why certain methods work well, and reveals fundamental limits that no method can overcome.