The EM algorithm iteratively estimates parameters of models with latent (unobserved) variables. The E-step computes expected latent values given current parameters; the M-step optimizes parameters given expected latents. EM guarantees monotonic likelihood improvement and is widely used for clustering, mixture models, and HMM training.
Many probabilistic models contain variables that we cannot observe. In a Gaussian mixture model, each data point was generated by one of K Gaussian components — but we do not know which one. In a hidden Markov model, each observation comes from a hidden state — but we cannot see the states. These unobserved quantities are called latent variables. If we could observe the latent variables, parameter estimation would be straightforward: just compute the maximum likelihood estimates for each component separately. Without them, the likelihood function becomes a sum over all possible latent assignments, and this sum makes direct optimization intractable.
EM sidesteps this by alternating between two steps. The E-step (Expectation) asks: given my current best guess for the parameters, what is the most probable explanation for each data point in terms of the latent variables? For a Gaussian mixture, this means computing the posterior probability that each data point belongs to each component — a soft, probabilistic assignment called the "responsibility." For a hidden Markov model, it means computing the probability of being in each hidden state at each time step using the forward-backward algorithm. Crucially, these are not hard assignments; every data point is fractionally assigned to every component according to the posterior.
The M-step (Maximization) asks: given these soft assignments, what parameter values best explain the data? Because the latent variables are now treated as known (in expectation), this becomes a weighted maximum likelihood problem that often has a closed-form solution. For a Gaussian mixture, the new mean of component k is just the weighted average of all data points, where the weights are the responsibilities computed in the E-step. Then you go back to the E-step with the new parameters and repeat.
The reason this works — and converges — is grounded in the mathematics of Jensen's inequality. The E-step constructs a lower bound on the log-likelihood (called the ELBO, or Evidence Lower Bound) that is tight at the current parameters. The M-step maximizes this lower bound. Because the bound was tight before the M-step, the log-likelihood at the new parameters is at least as high as at the old parameters. This guarantees monotonic non-decrease of the log-likelihood at every iteration — EM never makes things worse. However, it only guarantees ascent to a local maximum, not the global one. EM is sensitive to initialization, and running the algorithm multiple times from different starting points is standard practice.
EM is widely used wherever latent variables arise naturally: training Gaussian mixture models (soft clustering), fitting hidden Markov models (speech recognition, sequence labeling), computing the parameters of factor analysis, and handling missing data in statistics. Its appeal is that each step has a clean probabilistic interpretation and the M-step is often analytically solvable, making implementation straightforward even when the E-step requires dynamic programming or other inference algorithms internally.