Maximum likelihood estimation (MLE) finds the parameter values that make the observed data most probable under a specified distributional model. The log-likelihood function ℓ(θ) = Σᵢ log f(yᵢ; θ) is maximized with respect to θ, typically requiring numerical optimization. MLE estimators are consistent and asymptotically efficient (achieving the Cramér-Rao lower bound) under correct model specification. Under normality, OLS and MLE are equivalent for linear regression. When the distributional form is wrong, MLE can be inconsistent — quasi-MLE is a robust alternative that still provides consistent estimates for certain parameters like means.
Derive the MLE estimator for the mean of a normal distribution by hand — this makes the logic of maximizing the likelihood concrete before applying it to more complex models like logit.
Maximum likelihood estimation asks a deceptively simple question: given the data I observed, what parameter values would have made this data most likely to occur? If you flip a coin 10 times and get 7 heads, the MLE for the probability of heads is 0.7 — the value that assigns the highest probability to the outcome "7 heads in 10 flips." The same logic extends to any parametric model: specify a distribution for the data, write down the probability of observing your sample as a function of the parameters, and then find the parameters that maximize it.
In practice, we work with the log-likelihood rather than the likelihood itself. Because observations are assumed independent, the likelihood is a product of n terms, each between 0 and 1. This product becomes vanishingly small for large n and is prone to numerical underflow. Taking the log converts the product to a sum — ℓ(θ) = Σᵢ log f(yᵢ; θ) — which is much easier to work with analytically and numerically. Since log is a strictly increasing function, the θ that maximizes ℓ(θ) also maximizes L(θ), so nothing is lost.
One result you should know cold: for the normal linear regression model, MLE and OLS are identical. Plugging the normal density into the log-likelihood and maximizing with respect to β reduces algebraically to minimizing the sum of squared residuals — the same criterion OLS uses. This equivalence shows that OLS carries an implicit distributional assumption (normality) even though it is typically derived without one. In non-linear models like logit or Poisson regression, where OLS does not directly apply, MLE becomes the standard estimation approach.
MLE estimators have attractive large-sample (asymptotic) properties: they are consistent (converge to the true parameter as n → ∞), asymptotically normal, and asymptotically efficient — meaning they achieve the Cramér-Rao lower bound, the smallest variance any unbiased estimator can have. These properties, however, all depend on the model being correctly specified. If the assumed distribution does not match the true data-generating process, the estimator may converge to the wrong value entirely (inconsistency). This is the sharpest difference between MLE and OLS for linear regression: OLS only needs E[u|x] = 0 for consistency, while MLE needs the full distributional form to be right.
In small samples, MLE can be biased even when the model is correctly specified. The classic example is the variance of a normal distribution: the MLE divides by n rather than n−1, yielding a slightly downward-biased estimate. This finite-sample bias typically shrinks as n grows, but it is a reminder that the asymptotic efficiency of MLE does not mean it is always the best choice in small data settings.