Information geometry treats families of probability distributions as points on a Riemannian manifold, with the Fisher information matrix as the metric tensor. This geometric perspective reveals that statistical inference is fundamentally a geometric operation: maximum likelihood estimation finds the nearest point on a model manifold, the EM algorithm alternates between two dual projections, and exponential families are flat submanifolds. The dual connection structure (e-connection and m-connection) discovered by Amari captures the asymmetry of KL divergence geometrically. Information geometry unifies concepts from statistics, information theory, and differential geometry, providing deep structural insights into optimization, machine learning, and neural networks.
Probability distributions have a natural geometric structure. Consider the set of all Bernoulli distributions parameterized by p in (0,1). In the usual Euclidean view, this is a line segment. But from an information-theoretic perspective, distributions near p = 0 or p = 1 are packed more tightly — small changes in p create large changes in the distribution when p is extreme. The Fisher information I(p) = 1/(p(1-p)) captures this: the "information-theoretic distance" between p and p + dp is sqrt(I(p)) * dp, which is large near the boundaries. Information geometry makes this rigorous by using I(p) as a Riemannian metric.
For a parametric family {f(x; theta) : theta in Theta}, the Fisher information matrix g_{ij}(theta) = E[(d/d_theta_i log f)(d/d_theta_j log f)] serves as the metric tensor. The geodesic distance between nearby distributions f(x; theta) and f(x; theta + d_theta) is ds^2 = sum g_{ij} d_theta_i d_theta_j. This distance is invariant under reparameterization — by Cencov's theorem, it is the unique natural metric on statistical manifolds. Two distributions that look close in one parameterization but far in another are correctly measured by the Fisher metric regardless.
The deepest insight of information geometry is the dual connection structure. On a standard Riemannian manifold, there is one natural connection (the Levi-Civita connection). On a statistical manifold, there are two: the e-connection (exponential) and the m-connection (mixture), which are dual with respect to the Fisher metric. Exponential families are flat under the e-connection, meaning their natural parameters form a coordinate system in which e-geodesics are straight lines. Mixture families are flat under the m-connection. KL divergence is the canonical divergence of this dually flat structure, and its asymmetry (D_KL(p||q) != D_KL(q||p)) reflects the dual nature of the two connections.
This geometric framework illuminates algorithms. The EM algorithm alternates between e-projection and m-projection, which is why it converges monotonically. Natural gradient descent (used in training neural networks) follows geodesics in the Fisher metric rather than Euclidean straight lines in parameter space, leading to faster convergence. Variational inference minimizes KL divergence, which is a projection in the information-geometric sense. The field, developed primarily by Shun-ichi Amari, continues to provide structural insights into machine learning, optimization, and the foundations of statistics.