You observe 3 successes in 10 Bernoulli trials. What is the MLE for the success probability p, and why?
Ap̂ = 0.5, because we have no prior reason to prefer any other value
Bp̂ = 0.3, because it maximizes the likelihood of observing exactly 3 successes in 10 trials
Cp̂ = 0.5, because the MLE for Bernoulli trials always equals 0.5 by symmetry
Dp̂ = 0.3, because it is always the unbiased estimator of p
The MLE picks the parameter value that makes the observed data most probable. The likelihood is L(p) = p³(1−p)⁷. Setting d/dp[log L] = 3/p − 7/(1−p) = 0 gives p̂ = 3/10 = 0.3. This is the p for which observing exactly 3 successes in 10 trials is most likely. Option A is wrong (0.5 would be the MLE only if you observed 5 heads). Option D confuses MLE with unbiasedness — they often agree here, but the reason the MLE is 0.3 is that it maximizes the likelihood, not that it is unbiased.
Question 2 Multiple Choice
A researcher computes the MLE for the variance σ² of a normal distribution with unknown mean and obtains σ̂² = (1/n)Σ(xᵢ − x̄)². Which statement is correct?
AThis estimator is unbiased, because MLEs are always unbiased
BThis estimator is biased — dividing by n rather than n−1 underestimates the true variance for finite samples
CThis estimator is efficient, so it must also be unbiased
DThe bias is irrelevant because MLE only guarantees asymptotic properties
The MLE for normal variance (1/n)Σ(xᵢ − x̄)² has expectation (n−1)σ²/n — it systematically underestimates the true variance for any finite n. This is a concrete counterexample to the misconception that MLEs are always unbiased. MLEs are asymptotically unbiased (bias vanishes as n → ∞) but can be biased in finite samples. The unbiased estimator S² = (1/(n−1))Σ(xᵢ − x̄)² corrects for this. Option C confuses efficiency (minimum asymptotic variance) with unbiasedness — these are separate properties.
Question 3 True / False
The MLE usually produces a closed-form solution that can be computed analytically from a formula.
TTrue
FFalse
Answer: False
Many MLEs require numerical optimization. Logistic regression, mixture models, and neural networks all require iterative algorithms (gradient descent, Newton-Raphson, EM algorithm) to maximize the log-likelihood. Closed-form solutions exist for standard families like the normal, exponential, and binomial, but this is the exception rather than the rule in applied statistics.
Question 4 True / False
A large Fisher information value I(θ) implies the MLE will have high variance and be a poor estimator of θ.
TTrue
FFalse
Answer: False
This is backwards. Large Fisher information means the data is highly informative about θ — the log-likelihood is sharply peaked around the true value, and the MLE concentrates tightly around the truth. The asymptotic variance of the MLE is I(θ)⁻¹, so large I(θ) means small variance and a precise estimator. Low Fisher information means the likelihood is flat and the data is uninformative, leading to a high-variance MLE.
Question 5 Short Answer
What does it mean to say the MLE is 'the parameter value that makes the observed data most probable,' and why do we maximize the log-likelihood rather than the likelihood itself?
Think about your answer, then reveal below.
Model answer: The likelihood function L(θ|X) gives the probability (or density) of the observed data X for each candidate value of θ. The MLE θ̂ is the value of θ that maximizes this function — making the observed outcome as probable as possible under the assumed model. We maximize the log-likelihood because the log converts the product ∏f(xᵢ|θ) into a sum Σlog f(xᵢ|θ), which is easier to differentiate and numerically more stable. Since log is monotonically increasing, the maximizer of log L is identical to the maximizer of L.
The log transformation is one of the most powerful computational tricks in statistics. It converts products to sums, which are much easier to differentiate and prevent floating-point underflow when n is large. The score equation ∂ℓ/∂θ = 0 is often analytically tractable when the corresponding likelihood derivative would be algebraically complex. The invariance of the maximizer under monotone transformations is the mathematical justification — and it's why asymptotic theory is developed in terms of the log-likelihood and its curvature (Fisher information).