SRM chooses among hypothesis classes H_1, H_2, H_3 with VC dimensions 2, 5, 20. On 100 training examples, H_1 achieves 15% training error, H_2 achieves 4% training error, and H_3 achieves 0.5% training error. Which class would SRM likely select?
AH_3, because it has the lowest training error and the penalty term is negligible with 100 samples
BH_2, because it balances moderate training error with a manageable complexity penalty — the bound for H_3 adds a large penalty of roughly sqrt(20/100) ≈ 0.45, wiping out its training error advantage
CH_1, because SRM always prefers the simplest model regardless of training error
DIt depends entirely on the test set, which SRM does not use
SRM minimizes training error plus a complexity penalty. For H_1: 0.15 + C*sqrt(2/100) ≈ 0.15 + 0.14C. For H_2: 0.04 + C*sqrt(5/100) ≈ 0.04 + 0.22C. For H_3: 0.005 + C*sqrt(20/100) ≈ 0.005 + 0.45C. With typical constants, H_3's penalty term is so large that it overwhelms the low training error. H_2 strikes the best balance: substantially lower training error than H_1 with a modest complexity penalty. This is exactly the bias-complexity tradeoff in action — H_3 overfits, H_1 underfits, and H_2 finds the sweet spot.
Question 2 True / False
SRM is essentially equivalent to L2 regularization — both add a penalty to the training objective to prevent overfitting.
TTrue
FFalse
Answer: False
While SRM and L2 regularization both control complexity, they operate at different levels. SRM selects among discrete hypothesis classes (each with a different VC dimension) by adding a complexity penalty derived from the class's capacity. L2 regularization adds a continuous penalty on weight magnitudes within a single class. SRM is a model selection procedure that operates over the class hierarchy; L2 is an intra-class regularizer. They are related — L2 regularization can be viewed as implicitly searching over nested norm-bounded subsets — but SRM is more general and more directly connected to VC theory. SRM can incorporate any capacity measure, not just weight norms.
Question 3 True / False
If you had unlimited training data, SRM would always select the most complex hypothesis class in the hierarchy.
TTrue
FFalse
Answer: True
As the sample size n approaches infinity, the estimation error term sqrt(d_k/n) goes to zero for every finite d_k. With negligible estimation error, the SRM criterion is dominated by training error, which approximates approximation error. The most complex class (largest VC dimension) has the lowest approximation error because it can best approximate the target. With infinite data, there is no overfitting risk, so there is no reason to constrain complexity. The preference for simpler classes exists only because of finite-sample estimation error — the fundamental insight of the bias-complexity tradeoff.
Question 4 Short Answer
Explain how SRM operationalizes the bias-complexity tradeoff into an algorithm, and what theoretical guarantee it provides.
Think about your answer, then reveal below.
Model answer: SRM takes the abstract bias-complexity decomposition (total risk = approximation error + estimation error) and turns it into a concrete selection procedure. Given nested hypothesis classes H_1 ⊂ H_2 ⊂ ..., SRM evaluates each class by computing ERM training error (an estimate of approximation error) plus a complexity penalty proportional to sqrt(d_k/n) (an upper bound on estimation error). It selects the class minimizing this sum. The theoretical guarantee is that SRM achieves a total risk within a constant factor of the best possible risk achievable by the optimal class in the hierarchy, up to logarithmic factors. This means SRM adapts to the unknown complexity of the target: if the target is simple, SRM selects a simple class; if complex, it selects a complex one — all without knowing the target in advance.
SRM can be viewed as the theoretical foundation for information-theoretic model selection criteria like AIC and BIC, which also balance fit against complexity. The key advantage of SRM is that its penalties are derived from generalization theory rather than asymptotic approximations.