Generalization Bounds for Deep Networks

Research Depth 69 in the knowledge graph I know this Set as goal
Unlocks 12 downstream topics
generalization deep-learning pac-bayes margin-bounds

Core Idea

Classical generalization bounds based on VC dimension or parameter counting are vacuous for modern deep networks (they predict the network should not generalize at all). Tighter bounds have been developed using different complexity measures: spectral-norm bounds control generalization through the product of layer spectral norms divided by the margin; PAC-Bayes bounds measure the KL divergence between the learned weights and a prior distribution; compression-based bounds exploit the fact that trained networks can often be compressed without loss of accuracy. While these bounds are tighter than classical ones, they remain loose by orders of magnitude in practice — closing this gap is an active research frontier.

Explainer

The generalization puzzle for deep networks is stark: classical learning theory says a model with more parameters than training examples should memorize the training data and fail on test data. Modern deep networks routinely violate this prediction — they have orders of magnitude more parameters than examples yet generalize well. The search for tight, informative generalization bounds for deep networks is one of the most active areas in theoretical ML.

VC dimension and Rademacher complexity bounds, which work beautifully for simpler model classes, give vacuous bounds for deep networks. The VC dimension of a network grows with the number of parameters, producing bounds that exceed 100% error — mathematically valid but practically useless. The problem is that these measures treat all parameter settings as equally likely, ignoring that SGD navigates to a tiny, structured region of the parameter space. Better bounds must capture this structure.

Spectral-norm margin bounds (Bartlett, Foster, Telgarsky, 2017) measure complexity through the product of layer spectral norms divided by the classification margin. The spectral norm ||W_i|| of a weight matrix is its largest singular value — a measure of how much the layer amplifies signals. The bound on Rademacher complexity scales as the product of spectral norms times the Frobenius norm of the reference matrix, divided by the margin and sqrt(n). This is parameter-count-independent: a network with many parameters but well-controlled spectral norms (through normalization, regularization, or the implicit effects of SGD) can have a tighter bound than a smaller network with large spectral norms.

PAC-Bayes bounds take a different approach: they measure the "distance" (in KL divergence) between the learned weights and a prior distribution specified before training. The bound is O(sqrt(KL(posterior || prior) / n)). If SGD finds weights close to the initialization (which over-parameterization encourages), and the prior is centered at the initialization, the KL divergence can be small even with millions of parameters. Compression bounds offer yet another perspective: if the trained network can be described in k bits (through pruning, quantization, or low-rank factorization) without losing accuracy, the generalization bound depends on k, not the original parameter count. All three approaches — spectral norms, PAC-Bayes, and compression — attempt to capture the effective complexity of the learned function rather than the raw capacity of the architecture, and all give tighter (though still imperfect) bounds than classical measures.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueVariance and Standard Deviation of Random VariablesBias-Variance TradeoffPAC Learning FrameworkGrowth Function and ShatteringVC DimensionRademacher ComplexityGeneralization Bounds for Deep Networks

Longest path: 70 steps · 419 total prerequisite topics

Prerequisites (4)

Leads To (3)