Classical generalization bounds based on VC dimension or parameter counting are vacuous for modern deep networks (they predict the network should not generalize at all). Tighter bounds have been developed using different complexity measures: spectral-norm bounds control generalization through the product of layer spectral norms divided by the margin; PAC-Bayes bounds measure the KL divergence between the learned weights and a prior distribution; compression-based bounds exploit the fact that trained networks can often be compressed without loss of accuracy. While these bounds are tighter than classical ones, they remain loose by orders of magnitude in practice — closing this gap is an active research frontier.
The generalization puzzle for deep networks is stark: classical learning theory says a model with more parameters than training examples should memorize the training data and fail on test data. Modern deep networks routinely violate this prediction — they have orders of magnitude more parameters than examples yet generalize well. The search for tight, informative generalization bounds for deep networks is one of the most active areas in theoretical ML.
VC dimension and Rademacher complexity bounds, which work beautifully for simpler model classes, give vacuous bounds for deep networks. The VC dimension of a network grows with the number of parameters, producing bounds that exceed 100% error — mathematically valid but practically useless. The problem is that these measures treat all parameter settings as equally likely, ignoring that SGD navigates to a tiny, structured region of the parameter space. Better bounds must capture this structure.
Spectral-norm margin bounds (Bartlett, Foster, Telgarsky, 2017) measure complexity through the product of layer spectral norms divided by the classification margin. The spectral norm ||W_i|| of a weight matrix is its largest singular value — a measure of how much the layer amplifies signals. The bound on Rademacher complexity scales as the product of spectral norms times the Frobenius norm of the reference matrix, divided by the margin and sqrt(n). This is parameter-count-independent: a network with many parameters but well-controlled spectral norms (through normalization, regularization, or the implicit effects of SGD) can have a tighter bound than a smaller network with large spectral norms.
PAC-Bayes bounds take a different approach: they measure the "distance" (in KL divergence) between the learned weights and a prior distribution specified before training. The bound is O(sqrt(KL(posterior || prior) / n)). If SGD finds weights close to the initialization (which over-parameterization encourages), and the prior is centered at the initialization, the KL divergence can be small even with millions of parameters. Compression bounds offer yet another perspective: if the trained network can be described in k bits (through pruning, quantization, or low-rank factorization) without losing accuracy, the generalization bound depends on k, not the original parameter count. All three approaches — spectral norms, PAC-Bayes, and compression — attempt to capture the effective complexity of the learned function rather than the raw capacity of the architecture, and all give tighter (though still imperfect) bounds than classical measures.