4 questions to test your understanding
A network with 10^7 parameters achieves 1% training error and 5% test error. The VC-dimension-based bound predicts the test error could be as high as 100% (the bound is vacuous). Why do VC-based bounds fail here?
PAC-Bayes bounds depend on the KL divergence between the learned weight distribution and a prior distribution chosen before seeing data. Why is the choice of prior critical?
Spectrally-normalized margin bounds grow with the product of layer spectral norms, not the number of parameters. This means a 1000-layer network with small spectral norms per layer can generalize better than a 2-layer network with large spectral norms.
Explain why compression-based generalization bounds are conceptually natural for deep networks and how they avoid the parameter-counting problem.