You train a linear classifier (VC dimension 3) and a degree-10 polynomial classifier (VC dimension 11) on 50 data points. The polynomial achieves lower training error but higher test error. How does the bias-complexity decomposition explain this?
AThe polynomial has lower approximation error (it can fit the target better) but much higher estimation error (VC dimension 11 with only 50 points means the ERM solution is unreliable), and the estimation error dominates
BThe polynomial has higher approximation error because degree-10 polynomials introduce systematic distortion
CBoth have the same approximation error, but the polynomial's training algorithm is less efficient
DThe polynomial has lower estimation error because it fits the training data better, but higher bias from the polynomial assumption
The degree-10 polynomial class contains the linear class as a special case, so its approximation error is at most as large — likely smaller, since it can represent more complex boundaries. However, with VC dimension 11 and only 50 samples, the estimation error bound is roughly sqrt(11/50), which is substantial. The linear classifier's estimation error is roughly sqrt(3/50), much smaller. Even though the polynomial class has a better 'best hypothesis,' the ERM procedure with limited data selects a hypothesis far from that best, because the larger class offers too many near-optimal-on-training hypotheses that differ widely on unseen data.
Question 2 True / False
In the bias-complexity decomposition, increasing the hypothesis class size ALWAYS reduces approximation error.
TTrue
FFalse
Answer: True
Approximation error is defined as the minimum true risk achievable by any hypothesis in the class: min_{h in H} R(h). If you enlarge the class from H to H' where H is a subset of H', the minimum can only decrease or stay the same — the minimizer over a larger set is at most as large as the minimizer over a subset. This is a purely mathematical property of optimization over nested sets. The tradeoff is that the estimation error increases because the larger class is harder to learn from finite data. But approximation error itself is monotonically non-increasing as the class grows.
Question 3 True / False
A model with zero approximation error and very high estimation error will generally perform worse than a model with moderate approximation error and low estimation error.
TTrue
FFalse
Answer: True
Total risk equals approximation error plus estimation error. A class with zero approximation error contains the true target function, but if it is so rich that estimation error dominates (the ERM hypothesis is far from the best in the class due to limited data), the total risk can be very high. A simpler class with some approximation error but tight estimation error (because ERM reliably finds near-optimal hypotheses in the smaller class) can achieve lower total risk. This is the formal version of the practical observation that simpler models often outperform complex ones with limited data.
Question 4 Short Answer
How does the bias-complexity tradeoff differ from the classical bias-variance tradeoff, and what does the formal version add?
Think about your answer, then reveal below.
Model answer: The classical bias-variance tradeoff is stated for a specific model and specific loss function (typically squared error) and decomposes expected test error into bias squared, variance, and irreducible noise. The formal bias-complexity tradeoff operates at the level of hypothesis classes, not individual models, and decomposes the risk of the ERM procedure into approximation error (a property of the class) and estimation error (a property of the class complexity and sample size). The formal version adds precision: approximation error is defined as min_{h in H} R(h), and estimation error is bounded using VC dimension or Rademacher complexity. This makes the tradeoff quantitative — you can compute or bound each term — and connects it to sample complexity theory. It also makes the resolution actionable via structural risk minimization: choose the class that minimizes the sum of approximation and estimation error bounds.
The bias-variance decomposition is a statistical identity; the bias-complexity decomposition is a learning-theoretic framework. The former applies to any estimator; the latter is specifically about ERM over hypothesis classes. Both capture the same fundamental tension but in different mathematical languages.