In realizable PAC learning, sample complexity scales as O(d/epsilon). In agnostic PAC learning, it scales as O(d/epsilon^2). Why does removing the realizability assumption make learning harder?
AWithout realizability, the learning algorithm must use gradient descent instead of ERM, which converges more slowly
BIn the realizable case, any hypothesis with nonzero training error can be eliminated, drastically pruning the search space; in the agnostic case, every hypothesis has some error, so the learner must estimate error rates precisely — and estimating rates to epsilon accuracy requires O(1/epsilon^2) samples
CThe VC dimension effectively doubles in the agnostic setting because the hypothesis class must represent both the target and the noise
DAgnostic learning requires a separate validation set, consuming half the samples
In the realizable setting, the true target has zero error, so any hypothesis with even one training mistake is known to be imperfect — you can discard it with a single counterexample. This makes the problem combinatorial rather than statistical. In the agnostic setting, every hypothesis (including the best one) has nonzero error. The learner must distinguish between hypotheses with, say, 10% error and 10% + epsilon error, which requires estimating error rates to within epsilon. By Hoeffding's inequality, estimating a probability to within epsilon accuracy requires O(1/epsilon^2) samples. The extra 1/epsilon factor compared to the realizable case is this estimation cost.
Question 2 True / False
Agnostic PAC learning guarantees that the learned hypothesis achieves error within epsilon of the Bayes-optimal classifier.
TTrue
FFalse
Answer: False
Agnostic PAC learning guarantees error within epsilon of the best hypothesis IN THE CLASS, not the Bayes-optimal classifier. If the hypothesis class is limited (e.g., linear classifiers for a non-linear problem), the best-in-class hypothesis may be far from Bayes-optimal. The epsilon guarantee is relative to min_{h in H} R(h), not min over all measurable functions. This is the approximation error — a property of the class, not the algorithm. To get close to Bayes-optimal, you need both a rich enough class (low approximation error) and enough data (low estimation error).
Question 3 True / False
In the agnostic setting, ERM is still a valid learning algorithm — but it requires more samples than in the realizable case.
TTrue
FFalse
Answer: True
ERM remains the natural algorithm: find the hypothesis minimizing training error. In the agnostic setting, uniform convergence still guarantees that training error approximates true error for all hypotheses simultaneously, so the training-error minimizer is close to the true-risk minimizer. The difference is quantitative: the sample complexity increases from O(d/epsilon) to O(d/epsilon^2) because the learner must estimate error rates rather than just detect the presence or absence of errors. The algorithm is the same; the sample requirement is higher because the statistical task is harder.
Question 4 Short Answer
Explain why the agnostic setting is more practically relevant than the realizable setting, and what the 'price of agnosticism' is in terms of sample complexity.
Think about your answer, then reveal below.
Model answer: In practice, no hypothesis class perfectly captures the true data-generating process — there is always model misspecification, label noise, or irreducible error. The realizable assumption (that the target is in the class) is almost never true. Agnostic PAC learning drops this assumption, making the framework applicable to real problems. The price is a quadratic blowup in sample complexity: from O(d/epsilon) to O(d/epsilon^2), where d is the VC dimension. This extra 1/epsilon factor arises because the learner can no longer use the shortcut of eliminating hypotheses with training errors — instead, it must precisely estimate error rates to find the best-in-class hypothesis. The guarantee also weakens from 'within epsilon of zero error' to 'within epsilon of the best achievable error in the class,' making the approximation error of the class a separate concern.
The agnostic setting is also more natural for thinking about model selection: comparing two hypothesis classes means comparing their best achievable risks, and the agnostic framework makes this comparison precise.