A statistician writes L(θ) = ∏ p(xᵢ|θ) after observing data x₁, ..., xₙ. Which statement correctly describes what L(θ) is?
AA probability distribution over possible parameter values — the probability that θ takes each value given the data
BA measure of how probable the observed data would be for each candidate value of θ, with the data held fixed
CThe marginal probability of the data summed over all possible parameter values
DA probability distribution over possible datasets for a fixed value of θ
The likelihood function is not a distribution over θ — it doesn't represent the probability that θ has a particular value, and it doesn't integrate to 1 over θ. It is the joint probability of the observed data, re-read as a function of the parameter with the data held constant. Two things that are numerically identical can mean very different things: p(x|θ) is a probability over data for fixed θ; L(θ) = p(x|θ) is a function of θ for fixed data. Confusing these is the most common conceptual error in learning MLE.
Question 2 Multiple Choice
You flip a coin 10 times and observe 7 heads. What does MLE give as the estimate of the probability of heads?
A0.5 — a fair coin is the most principled default assumption
B0.7 — this is the parameter value that makes observing exactly 7 heads in 10 flips most probable
CIt cannot be determined without specifying a prior distribution over the probability of heads
D0.7 if the coin is known to be biased; 0.5 if the coin is assumed fair
MLE finds the θ̂ that maximizes L(θ) = C(10,7) θ⁷(1−θ)³. Taking the log-likelihood and differentiating gives θ̂ = 7/10 = 0.7. MLE makes no use of prior beliefs about whether the coin 'should' be fair — it answers only: which θ makes the data you observed most probable? A prior distribution is a Bayesian concept, not part of MLE.
Question 3 True / False
The likelihood function L(θ) is a probability distribution over the parameter θ and therefore integrates (or sums) to 1 over most possible values of θ.
TTrue
FFalse
Answer: False
The likelihood function is not a probability distribution over θ. It doesn't integrate to 1 over θ and has no probabilistic interpretation as a distribution over parameter values. It is a function measuring the compatibility of the observed data with each value of θ. Treating it as a distribution over θ is the confusion that motivates Bayesian statistics — to get a proper distribution over θ you need a prior, which MLE does not use.
Question 4 True / False
Maximizing the log-likelihood ℓ(θ) = Σ log p(xᵢ|θ) gives the same θ̂ as maximizing the likelihood L(θ) = ∏ p(xᵢ|θ).
TTrue
FFalse
Answer: True
The logarithm is strictly increasing, so it preserves the location of the maximum: the θ that maximizes L(θ) is the same θ that maximizes log L(θ). The log-likelihood is preferred in practice because it converts products into sums (easier to differentiate) and avoids numerical underflow from multiplying many small probabilities. The mathematical result is identical.
Question 5 Short Answer
What is the central question MLE asks, and how does it differ from the question that a probability mass or density function answers?
Think about your answer, then reveal below.
Model answer: A PMF/PDF answers: given this parameter value θ, how probable is this outcome? MLE inverts the question: given the observed data, which parameter value θ makes that data most probable? The PMF treats θ as fixed and data as variable; the likelihood function treats the observed data as fixed and θ as the variable to optimize over. MLE finds the θ that would have made the data you actually saw the least surprising.
This inversion is conceptually subtle. p(x|θ) and L(θ) = p(x|θ) are numerically the same expression but ask different questions. Failing to see the difference leads to treating the likelihood as a probability over θ. MLE is a frequentist procedure: it finds the best-fit parameter but makes no probability claims about where the true θ lies — that is the province of Bayesian inference.