Why is the Fisher information matrix the natural choice of Riemannian metric on a statistical manifold, rather than the Euclidean metric on parameters?
AThe Euclidean metric is not defined for probability distributions
BThe Fisher metric is invariant under reparameterization — changing coordinates (e.g., from probability p to log-odds) does not change the geometric structure, while the Euclidean metric depends on the arbitrary choice of parameterization
CThe Fisher metric is always positive definite, while the Euclidean metric is not
DThe Euclidean metric requires distributions to have the same support
Cencov's theorem proves that the Fisher information matrix is the UNIQUE (up to scaling) Riemannian metric that is invariant under sufficient statistics (Markov morphisms). This means the geometry does not depend on how you parameterize the family — distance between distributions is an intrinsic property. The Euclidean metric d(theta_1, theta_2) = ||theta_1 - theta_2|| depends on the parameterization: distributions that are 'close' in one parameterization may be 'far' in another. The Fisher metric avoids this arbitrariness.
Question 2 True / False
In information geometry, exponential families are flat manifolds under the e-connection, and mixture families are flat under the m-connection.
TTrue
FFalse
Answer: True
This dually flat structure is a central result of information geometry. Exponential families (Gaussian, Bernoulli, Poisson, etc.) have zero curvature under the e-connection, meaning e-geodesics (exponential interpolations between distributions) are straight lines in natural parameters. Mixture families (convex combinations of distributions) are flat under the m-connection. The KL divergence D_KL(p||q) is the Bregman divergence associated with the e-flat structure, and it naturally decomposes into projections along the two dual connections. This duality explains why EM converges (each step is a dual projection).
Question 3 Short Answer
Explain the Pythagorean theorem in information geometry and how it relates to the projection properties of maximum likelihood estimation.
Think about your answer, then reveal below.
Model answer: In a dually flat manifold, the generalized Pythagorean theorem states: for distributions p, q, r where q is the m-projection (mixture projection) of p onto a submanifold M, D_KL(p || r) = D_KL(p || q) + D_KL(q || r) for any r in M. This is analogous to ||p-r||^2 = ||p-q||^2 + ||q-r||^2 in Euclidean geometry, where q is the orthogonal projection. MLE is an m-projection: the MLE distribution is the point in the model family closest to the empirical distribution in KL divergence. The Pythagorean theorem guarantees that this projection decomposes the total KL divergence into 'model error' (p to q) and 'within-model distance' (q to r), which is the geometric foundation of model selection and goodness-of-fit testing.
The Pythagorean relation requires that q be the projection of p and r be in the flat submanifold — the 'right angle' condition. This generalizes the familiar orthogonal decomposition in linear regression to the nonlinear setting of exponential families, providing geometric insight into why maximum likelihood has good properties.