A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Maximum Likelihood Estimation

Graduate Depth 109 in the knowledge graph ☐ I know this ☆ Set as goal

35topics build on this

640prerequisites beneath it

Classical OLS Assumptions (Gauss-Markov)Fundamental Theorem of Calculus Part 1 +9 more→→Count Data Models: Poisson and Negative Binomial Regression Generalized Method of Moments (GMM)+7 more

Core Idea

Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable under a specified distributional model. The log-likelihood function ℓ(θ) = Σᵢ log f(yᵢ; θ) is maximized with respect to θ, typically requiring numerical optimization. MLE estimators are consistent and asymptotically efficient (achieving the Cramér-Rao lower bound) under correct model specification. Under normality, OLS and MLE are equivalent for linear regression. When the distributional form is wrong, MLE can be inconsistent — quasi-MLE is a robust alternative that still provides consistent estimates for certain parameters like means.

How It's Best Learned

Derive the MLE estimator for the mean of a normal distribution by hand — this makes the logic of maximizing the likelihood concrete before applying it to more complex models like logit.

Common Misconceptions

MLE requires a correctly specified distributional assumption; when in doubt, OLS with robust standard errors is safer for linear models.
The MLE is not always the most intuitive estimator — in small samples it can be biased (e.g., the MLE for the normal variance divides by n, not n−1).

Explainer

Maximum likelihood estimation asks a deceptively simple question: given the data I observed, what parameter values would have made this data most likely to occur? If you flip a coin 10 times and get 7 heads, the MLE for the probability of heads is 0.7 — the value that assigns the highest probability to the outcome "7 heads in 10 flips." The same logic extends to any parametric model: specify a distribution for the data, write down the probability of observing your sample as a function of the parameters, and then find the parameters that maximize it.

In practice, we work with the log-likelihood rather than the likelihood itself. Because observations are assumed independent, the likelihood is a product of n terms, each between 0 and 1. This product becomes vanishingly small for large n and is prone to numerical underflow. Taking the log converts the product to a sum — ℓ(θ) = Σᵢ log f(yᵢ; θ) — which is much easier to work with analytically and numerically. Since log is a strictly increasing function, the θ that maximizes ℓ(θ) also maximizes L(θ), so nothing is lost.

One result you should know cold: for the normal linear regression model, MLE and OLS are identical. Plugging the normal density into the log-likelihood and maximizing with respect to β reduces algebraically to minimizing the sum of squared residuals — the same criterion OLS uses. This equivalence shows that OLS carries an implicit distributional assumption (normality) even though it is typically derived without one. In non-linear models like logit or Poisson regression, where OLS does not directly apply, MLE becomes the standard estimation approach.

MLE estimators have attractive large-sample (asymptotic) properties: they are consistent (converge to the true parameter as n → ∞), asymptotically normal, and asymptotically efficient — meaning they achieve the Cramér-Rao lower bound, the smallest variance any unbiased estimator can have. These properties, however, all depend on the model being correctly specified. If the assumed distribution does not match the true data-generating process, the estimator may converge to the wrong value entirely (inconsistency). This is the sharpest difference between MLE and OLS for linear regression: OLS only needs E[u|x] = 0 for consistency, while MLE needs the full distributional form to be right.

In small samples, MLE can be biased even when the model is correctly specified. The classic example is the variance of a normal distribution: the MLE divides by n rather than n−1, yielding a slightly downward-biased estimate. This finite-sample bias typically shrinks as n grows, but it is a reminder that the asymptotic efficiency of MLE does not mean it is always the best choice in small data settings.

Practice Questions 3 questions