The alpha-connection with alpha=0 is the Levi-Civita connection (flat connection). How do exponential and mixture families sit relative to this symmetric connection?
AExponential and mixture families are both flat under the Levi-Civita connection
BExponential families are e-flat (alpha=+1), mixture families are m-flat (alpha=-1), and the Levi-Civita connection (alpha=0) is equidistant between them in a geometric sense
COnly exponential families have geometric structure; mixture families are generic
DThe choice of alpha does not affect the flatness of either family
The alpha-connection family interpolates: the e-connection (alpha=+1) makes exponential families flat (zero curvature), and the m-connection (alpha=-1) makes mixture families flat. The Levi-Civita connection (alpha=0) is the 'middle' connection, symmetric between the two duals. This dual structure is unique to information geometry — ordinary Riemannian manifolds have only one natural connection. The duality explains many phenomena: the KL divergence, while asymmetric, decomposes nicely due to the existence of two dual flat structures. When a surface is flat in one connection, geodesics in that connection are straight lines in the natural coordinates.
Question 2 True / False
The generalized Pythagorean theorem states: if S is a submanifold that is m-flat, and q is the m-projection of p onto S, then D_KL(p||r) = D_KL(p||q) + D_KL(q||r) for any r in S. Why does this 'right angle' property only hold for m-projections onto m-flat submanifolds?
TTrue
FFalse
Answer: True
The Pythagorean theorem requires that the submanifold S and the projection direction are compatible. For m-projections onto m-flat submanifolds (like mixture families), the projection is orthogonal in the Riemannian sense with respect to the e-connection, creating the 'right angle' condition. This decomposition means the KL from p to any point r in S separates into independent components: the 'error' (p to q) and the 'within-manifold distance' (q to r). This property is fundamental to optimization: projecting onto a flat submanifold and then optimizing within it achieves the global best point as if you optimized over both dimensions simultaneously. The EM algorithm alternates e-projections and m-projections, which is why it converges monotonically (each projection decreases the objective).
Question 3 Short Answer
Explain how natural gradient descent differs from Euclidean gradient descent in parameter space, and why it converges faster on statistical manifolds.
Think about your answer, then reveal below.
Model answer: Euclidean gradient descent updates parameters as theta_{t+1} = theta_t - eta * d(L)/d(theta), where L is a loss function. This updates in straight lines in parameter space, which is coordinate-dependent — the convergence rate depends on how you parameterize the problem. Natural gradient descent uses the Fisher information matrix F as a metric: theta_{t+1} = theta_t - eta * F^(-1) * d(L)/d(theta). The inverse Fisher information F^(-1) rescales gradients by the information content, so updates in direction of high information are dampened (more cautious) and updates in direction of low information are amplified. Geometrically, natural gradient follows geodesics in the statistical manifold (Fisher metric). This is faster because geodesics are the shortest paths between distributions in the information-geometric sense. For convex losses on exponential families, natural gradient achieves faster convergence (linear in the dimension rather than quadratic). The KL divergence D_KL(p||p_theta) between true and model distributions decreases monotonically along natural gradient trajectories at a rate determined by the manifold curvature.
Natural gradient descent is coordinate-invariant (changing parameterization doesn't change the algorithm's behavior), adaptive (adjusts step size based on local information content), and geometrically principled. Neural networks trained with natural gradient (or approximations like K-FAC) often converge faster than Euclidean gradient descent, especially in early training when the manifold curvature matters most.
Question 4 Multiple Choice
In the EM algorithm, E-step performs an m-projection (finding the posterior distribution) and M-step performs an e-projection (finding the maximum-likelihood parameters). Why does this alternation guarantee monotonic convergence of the log-likelihood?
ABecause both projections decrease the KL divergence
BBecause m-projections and e-projections are orthogonal in the information-geometric sense (with respect to dual connections), and each step brings the solution closer to the global optimum in the dually flat space
CBecause the E-step and M-step are inverses of each other
DConvergence is not guaranteed; the EM algorithm can diverge
In information geometry terms, the EM algorithm alternates between two projections in a dually flat space. The latent variable posterior (E-step) is an m-projection onto the simplex of latent distributions. The parameter update (M-step) is an e-projection onto the manifold of likelihood functions. These two projections are orthogonal with respect to the dual connections, and the Pythagorean theorem ensures that each step decreases the KL divergence between the true (unknown) posterior and the current estimate. The log-likelihood improvement is a consequence of the geometric monotonicity. This explains why EM is so reliable: the geometric structure guarantees convergence without requiring convexity, line searches, or other heuristics.