Expectation-Maximization Algorithm

Graduate Depth 72 in the knowledge graph I know this Set as goal
Unlocks 1 downstream topic
em expectation-maximization latent

Core Idea

The EM algorithm iteratively estimates parameters of models with latent (unobserved) variables. The E-step computes expected latent values given current parameters; the M-step optimizes parameters given expected latents. EM guarantees monotonic likelihood improvement and is widely used for clustering, mixture models, and HMM training.

Explainer

Many probabilistic models contain variables that we cannot observe. In a Gaussian mixture model, each data point was generated by one of K Gaussian components — but we do not know which one. In a hidden Markov model, each observation comes from a hidden state — but we cannot see the states. These unobserved quantities are called latent variables. If we could observe the latent variables, parameter estimation would be straightforward: just compute the maximum likelihood estimates for each component separately. Without them, the likelihood function becomes a sum over all possible latent assignments, and this sum makes direct optimization intractable.

EM sidesteps this by alternating between two steps. The E-step (Expectation) asks: given my current best guess for the parameters, what is the most probable explanation for each data point in terms of the latent variables? For a Gaussian mixture, this means computing the posterior probability that each data point belongs to each component — a soft, probabilistic assignment called the "responsibility." For a hidden Markov model, it means computing the probability of being in each hidden state at each time step using the forward-backward algorithm. Crucially, these are not hard assignments; every data point is fractionally assigned to every component according to the posterior.

The M-step (Maximization) asks: given these soft assignments, what parameter values best explain the data? Because the latent variables are now treated as known (in expectation), this becomes a weighted maximum likelihood problem that often has a closed-form solution. For a Gaussian mixture, the new mean of component k is just the weighted average of all data points, where the weights are the responsibilities computed in the E-step. Then you go back to the E-step with the new parameters and repeat.

The reason this works — and converges — is grounded in the mathematics of Jensen's inequality. The E-step constructs a lower bound on the log-likelihood (called the ELBO, or Evidence Lower Bound) that is tight at the current parameters. The M-step maximizes this lower bound. Because the bound was tight before the M-step, the log-likelihood at the new parameters is at least as high as at the old parameters. This guarantees monotonic non-decrease of the log-likelihood at every iteration — EM never makes things worse. However, it only guarantees ascent to a local maximum, not the global one. EM is sensitive to initialization, and running the algorithm multiple times from different starting points is standard practice.

EM is widely used wherever latent variables arise naturally: training Gaussian mixture models (soft clustering), fitting hidden Markov models (speech recognition, sequence labeling), computing the parameters of factor analysis, and handling missing data in statistics. Its appeal is that each step has a clean probabilistic interpretation and the M-step is often analytically solvable, making implementation straightforward even when the E-step requires dynamic programming or other inference algorithms internally.

Practice Questions 3 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIterated Integrals and Fubini's TheoremJoint Distributions and Marginals (Rigorous)Independence of Sigma-AlgebrasConditional ExpectationMarkov ChainsHidden Markov ModelsExpectation-Maximization Algorithm

Longest path: 73 steps · 434 total prerequisite topics

Prerequisites (9)

Leads To (1)