A coin has unknown bias theta. You flip it n times. The Fisher information per flip is I(theta) = 1/(theta(1-theta)). At theta = 0.5 (fair coin), I = 4, while at theta = 0.01 (very biased), I = 1/(0.01*0.99) ≈ 101. Why is Fisher information higher for the biased coin?
ABiased coins provide more entropy per flip
BEach flip from a biased coin is more informative about theta because the outcome is more deterministic — a single flip from a near-certain coin strongly confirms or refutes the hypothesized bias, while a fair coin flip is ambiguous about theta
CFisher information is inversely related to entropy, so lower entropy means higher information
DThe formula is incorrect for extreme theta values
At theta = 0.01, most flips land heads (say). An occasional tail is very surprising and sharply constrains theta. The log-likelihood changes steeply with theta near extreme values. At theta = 0.5, both outcomes are equally common regardless of small changes in theta, so each flip is less informative about the precise value of theta. Fisher information measures sensitivity to theta, not entropy. The Cramer-Rao bound confirms: Var(theta-hat) >= theta(1-theta)/n, which is minimized at extreme theta (easiest to estimate precisely).
Question 2 True / False
The Cramer-Rao bound states that no unbiased estimator can have variance lower than 1/(nI(theta)) for n independent observations.
TTrue
FFalse
Answer: True
For n i.i.d. observations, the total Fisher information is n*I(theta), and the Cramer-Rao lower bound (CRLB) on the variance of any unbiased estimator is 1/(n*I(theta)). Maximum likelihood estimators (MLEs) are asymptotically efficient: their variance approaches the CRLB as n grows. The CRLB is the information-theoretic limit of estimation precision — Fisher information determines the hardest-possible accuracy floor for any unbiased method.
Question 3 Short Answer
Explain the relationship between Fisher information and KL divergence, and why this connection matters for information geometry.
Think about your answer, then reveal below.
Model answer: Fisher information is the second derivative of KL divergence: I(theta) = d^2/d_theta'^2 D_KL(f(x;theta) || f(x;theta')) evaluated at theta' = theta. KL divergence measures how different two distributions are; Fisher information measures how quickly they become different as theta changes. This makes Fisher information a local measure of distinguishability between nearby distributions. In information geometry, Fisher information serves as the Riemannian metric tensor on the manifold of probability distributions — it defines the 'distance' between infinitesimally close distributions. Geodesics on this manifold (shortest paths in the Fisher metric) correspond to natural interpolations between distributions, and the curvature of the manifold reveals the statistical structure of the model family.
The Fisher information matrix (for vector parameters) generalizes this to multiple parameters: I_{ij}(theta) = E[(d/d_theta_i log f)(d/d_theta_j log f)]. This positive-definite matrix defines a Riemannian metric, making the space of distributions a curved manifold. This is the foundation of information geometry.