A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Maximum Likelihood Estimation (Theory)

College Depth 109 in the knowledge graph ☐ I know this ☆ Set as goal

32topics build on this

722prerequisites beneath it

Central Limit Theorem (Rigorous via Characteristic Functions)Optimization Problems→→Ability Parameter Estimation and Theta Estimation Methods Asymptotic Normality of the MLE +5 more

Core Idea

The maximum likelihood estimator (MLE) θ̂ₙ maximizes the likelihood L(θ|X) = ∏ᵢ f(Xᵢ|θ). MLEs have desirable asymptotic properties: consistency, asymptotic normality, and efficiency (achieving the Cramer-Rao bound asymptotically). Under regularity conditions, θ̂ₙ solves ∂log L/∂θ = 0 and is unique.

How It's Best Learned

Compute MLEs for standard families (normal, exponential, binomial). Verify regularity conditions. Apply the asymptotic normality result to construct confidence intervals.

Common Misconceptions

Thinking MLEs are always unbiased; MLEs can be biased for finite samples. - Assuming the MLE always has a closed form; many MLEs require numerical optimization. - Forgetting that asymptotic normality requires regularity conditions.

Explainer

Maximum likelihood estimation formalizes a natural intuition: given observed data, choose the parameter value that makes the data most probable. Suppose you flip a coin 10 times and get 7 heads. You don't know the coin's bias p. The likelihood function L(p | data) = p⁷(1 − p)³ tells you how probable the observed outcome (7 heads in 10 flips) would be for each candidate value of p. L(0.5) ≈ 0.117, L(0.7) ≈ 0.267, L(0.9) ≈ 0.057. The value p = 0.7 makes the data most probable — and indeed, maximizing the likelihood analytically (by setting its derivative to zero) gives p̂ = 7/10 = 0.7. The MLE is the parameter value that best "explains" the data you actually observed.

In practice, you maximize the log-likelihood ℓ(θ) = log L(θ | X) = Σᵢ log f(Xᵢ | θ) rather than the likelihood itself. Logs convert products to sums, which are easier to differentiate, and since log is monotonically increasing, the maximizer doesn't change. Setting the score equation ∂ℓ/∂θ = 0 and solving gives the MLE. For the Gaussian N(μ, σ²) with known variance, differentiating Σᵢ(xᵢ − μ)²/σ² with respect to μ immediately yields μ̂ = x̄, the sample mean. For the exponential distribution with rate λ, the MLE is λ̂ = 1/x̄. These closed-form solutions are convenient, but many models (logistic regression, mixture models) require numerical optimization of the log-likelihood — your prerequisite optimization knowledge is directly applicable here.

The asymptotic theory is what makes MLEs so valuable beyond finite samples. Under regularity conditions (roughly: the model is identifiable, the true parameter lies in the interior of the parameter space, and derivatives exchange with integrals), the MLE θ̂ₙ based on n i.i.d. observations satisfies three properties. First, consistency: θ̂ₙ → θ₀ in probability as n → ∞. Second, asymptotic normality: √n(θ̂ₙ − θ₀) → N(0, I(θ₀)⁻¹) in distribution, where I(θ) = −E[∂²ℓ/∂θ²] is the Fisher information. Third, efficiency: no consistent estimator has a smaller asymptotic variance than I(θ₀)⁻¹, the Cramér-Rao lower bound.

The Fisher information deserves emphasis. It measures how much a single observation "tells you" about θ — how sharply peaked the log-likelihood is around the true value. Large Fisher information means the data is highly informative, the MLE concentrates tightly around the truth, and you need fewer observations to estimate precisely. The asymptotic normality result lets you construct approximate confidence intervals: θ̂ ± 1.96/√(n·I(θ̂)). This is the workhorse of likelihood-based inference — valid for any model satisfying regularity conditions, without requiring the data itself to be normally distributed. The price is that these guarantees are asymptotic: for small samples, the MLE can be biased and its variance may not match the Fisher information bound.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Fundamental Theorem of Calculus Part 1 → Fundamental Theorem of Calculus Part 2 → U-Substitution → Partial Fraction Decomposition for Integration → Improper Integrals - Convergence → Integral Test → P-Series → Comparison Test → Limit Comparison Test → Series Convergence Test Strategy → Power Series → Radius and Interval of Convergence → Taylor Series → Moment Generating Functions → Characteristic Functions → Convergence in Distribution → Stationary Distributions → Convergence of Markov Chains → Convergence in Probability → Almost Sure Convergence → Relationships Between Modes of Convergence → Weak Law of Large Numbers → Strong Law of Large Numbers → Central Limit Theorem (Rigorous via Characteristic Functions) → Maximum Likelihood Estimation (Theory)

Longest path: 110 steps · 722 total prerequisite topics

Prerequisites (2)

Optimization Problemssoft Central Limit Theorem (Rigorous via Characteristic Functions)soft

Leads To (7)

Ability Parameter Estimation and Theta Estimation Methodshard Asymptotic Normality of the MLEhard Consistency of Estimatorssoft Exponential Family of Distributionssoft Neyman-Pearson Lemmahard Parameter Estimation in Biological Modelssoft Two-Parameter Logistic IRT Model (2PL)soft