Why is uniform convergence (simultaneous over all hypotheses) required for ERM to work, rather than just convergence for each individual hypothesis?
AIndividual convergence is sufficient for ERM; uniform convergence is only needed for Bayesian methods
BERM selects the hypothesis with the lowest training error — a data-dependent choice. Without uniform convergence, training error might be a good estimate for most hypotheses but grossly underestimate the error of the specific hypothesis ERM selects
CUniform convergence is needed because ERM evaluates all hypotheses in parallel, requiring simultaneous accuracy
DIndividual convergence guarantees apply only to fixed hypotheses determined before seeing data, but ERM's choice depends on the data
This is the crux of the issue. For any FIXED hypothesis chosen before seeing the data, the law of large numbers ensures training error converges to true error. But ERM does not pick a fixed hypothesis — it searches the class for the one that looks best on the training data. This data-dependent selection biases the choice toward hypotheses that happen to perform unusually well on the specific training sample (possibly by fitting noise). Uniform convergence controls this by ensuring that NO hypothesis in the class has a large gap between training and true error — so even the one ERM selects must have training error close to its true error.
Question 2 Multiple Choice
Uniform convergence with rate epsilon means that for every hypothesis h in H, |training_error(h) - true_error(h)| <= epsilon. If a class has VC dimension 10 and you want epsilon = 0.05 with delta = 0.05, approximately how many samples are needed?
AAbout 200 samples — roughly d/epsilon
BAbout 4,000 samples — roughly d/epsilon^2, which for VC dimension 10 and epsilon = 0.05 gives 10/0.0025 = 4,000
CAbout 40,000 samples — roughly d^2/epsilon^2
DAbout 100 samples — roughly d * log(1/epsilon)
The standard uniform convergence bound for the agnostic setting gives sample complexity O((d + log(1/delta))/epsilon^2). For d = 10, epsilon = 0.05, delta = 0.05: roughly (10 + 3)/0.0025 ≈ 5,200. The O(d/epsilon^2) term dominates, giving approximately 4,000. The key scaling is linear in VC dimension and inverse-quadratic in epsilon. This quadratic dependence on 1/epsilon is the cost of estimating continuous error rates rather than detecting binary consistency (which would give 1/epsilon).
Question 3 True / False
Uniform convergence is both necessary and sufficient for PAC learnability in binary classification.
TTrue
FFalse
Answer: True
This is one of the key equivalences in the fundamental theorem of statistical learning. For binary classification with 0-1 loss, a hypothesis class is PAC-learnable if and only if it has the uniform convergence property. The forward direction (uniform convergence implies learnability) is straightforward: if training error uniformly approximates true error, ERM succeeds. The reverse direction (learnability implies uniform convergence) is deeper: it shows that any successful learning algorithm implicitly requires the class to have uniform convergence. This equivalence is specific to the binary classification setting — it does not hold for all loss functions or learning settings.
Question 4 True / False
Adding more hypotheses to a class can only make uniform convergence slower or unchanged — it can never speed it up.
TTrue
FFalse
Answer: True
Uniform convergence requires controlling the maximum gap between training and true error over ALL hypotheses. Adding hypotheses to the class creates more potential for some hypothesis to have a large gap by chance. Formally, the uniform convergence rate depends on the class complexity (VC dimension, Rademacher complexity), which can only increase or stay the same when the class is enlarged. A larger class has more 'degrees of freedom' for the training error to be misleading, requiring more data to ensure no hypothesis is badly misestimated.
Question 5 Short Answer
Explain the relationship between uniform convergence, the union bound, and the growth function in proving generalization bounds.
Think about your answer, then reveal below.
Model answer: The proof proceeds in three steps. First, for any single fixed hypothesis, concentration inequalities (like Hoeffding's) show that its training error is within epsilon of its true error with high probability. Second, to extend this to all hypotheses simultaneously, we need a union bound over the class — but the class may be infinite. Third, the growth function resolves this: on a sample of n points, the class induces at most Pi_H(n) distinct labelings, so only Pi_H(n) 'effective' hypotheses need to be controlled. The union bound over Pi_H(n) hypotheses gives P(any gap > epsilon) <= Pi_H(n) * exp(-2n*epsilon^2). When VC dimension is finite, the Sauer-Shelah lemma bounds Pi_H(n) as polynomial in n, making the union bound effective. The overall bound shows uniform convergence with sample complexity O((d*log(n) + log(1/delta))/epsilon^2).
The three ingredients — concentration for one hypothesis, growth function for counting effective hypotheses, union bound for combining — are the workhorses of classical learning theory. Every generalization bound in this course uses some variant of this template.