← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Expectation-Maximization Algorithm

Graduate Depth 97 in the knowledge graph ☐ I know this ☆ Set as goal

1topic build on this

628prerequisites beneath it

See this on the map →

Probabilistic Graphical Models Conditional Expectation +7 more→→Mixture Models and Gaussian Mixture Models

em expectation-maximization latent

Core Idea

The EM algorithm iteratively estimates parameters of models with latent (unobserved) variables. The E-step computes expected latent values given current parameters; the M-step optimizes parameters given expected latents. EM guarantees monotonic likelihood improvement and is widely used for clustering, mixture models, and HMM training.

Explainer

Many probabilistic models contain variables that we cannot observe. In a Gaussian mixture model, each data point was generated by one of K Gaussian components — but we do not know which one. In a hidden Markov model, each observation comes from a hidden state — but we cannot see the states. These unobserved quantities are called latent variables. If we could observe the latent variables, parameter estimation would be straightforward: just compute the maximum likelihood estimates for each component separately. Without them, the likelihood function becomes a sum over all possible latent assignments, and this sum makes direct optimization intractable.

EM sidesteps this by alternating between two steps. The E-step (Expectation) asks: given my current best guess for the parameters, what is the most probable explanation for each data point in terms of the latent variables? For a Gaussian mixture, this means computing the posterior probability that each data point belongs to each component — a soft, probabilistic assignment called the "responsibility." For a hidden Markov model, it means computing the probability of being in each hidden state at each time step using the forward-backward algorithm. Crucially, these are not hard assignments; every data point is fractionally assigned to every component according to the posterior.

The M-step (Maximization) asks: given these soft assignments, what parameter values best explain the data? Because the latent variables are now treated as known (in expectation), this becomes a weighted maximum likelihood problem that often has a closed-form solution. For a Gaussian mixture, the new mean of component k is just the weighted average of all data points, where the weights are the responsibilities computed in the E-step. Then you go back to the E-step with the new parameters and repeat.

The reason this works — and converges — is grounded in the mathematics of Jensen's inequality. The E-step constructs a lower bound on the log-likelihood (called the ELBO, or Evidence Lower Bound) that is tight at the current parameters. The M-step maximizes this lower bound. Because the bound was tight before the M-step, the log-likelihood at the new parameters is at least as high as at the old parameters. This guarantees monotonic non-decrease of the log-likelihood at every iteration — EM never makes things worse. However, it only guarantees ascent to a local maximum, not the global one. EM is sensitive to initialization, and running the algorithm multiple times from different starting points is standard practice.

EM is widely used wherever latent variables arise naturally: training Gaussian mixture models (soft clustering), fitting hidden Markov models (speech recognition, sequence labeling), computing the parameters of factor analysis, and handling missing data in statistics. Its appeal is that each step has a clean probabilistic interpretation and the M-step is often analytically solvable, making implementation straightforward even when the E-step requires dynamic programming or other inference algorithms internally.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Law of Total Probability → Bayes' Theorem and Statistical Inference → Bayesian Networks and Inference → Probabilistic Graphical Models → Expectation-Maximization Algorithm

Longest path: 98 steps · 628 total prerequisite topics

Prerequisites (9)

Probabilistic Graphical Modelshard Hidden Markov Modelssoft Expected Value: Theory and Propertiessoft Conditional Probabilitysoft Derivatives of Exponential Functionssoft Probability Axiomssoft Conditional Expectationsoft Optimization Problemssoft Expected Value and Variancesoft

Leads To (1)

Mixture Models and Gaussian Mixture Modelshard