Lottery Ticket Hypothesis

Research Depth 79 in the knowledge graph I know this Set as goal
network-pruning sparsity lottery-ticket overparameterization

Core Idea

The Lottery Ticket Hypothesis (LTH), proposed by Frankle and Carbin (2019), posits that dense neural networks contain sparse subnetworks ("winning lottery tickets") that, when trained in isolation from random initialization, achieve test accuracy comparable to the original dense network. A winning ticket is found by training a dense network, pruning low-magnitude weights, and training the remaining weights from their original initialization (not from scratch). This suggests that dense networks redundantly encode multiple possible solutions; training initializes an implicit lottery where some random initializations hit a winning ticket. LTH has profound implications for neural network structure, generalization, and optimization efficiency.

Explainer

The Lottery Ticket Hypothesis challenges how we think about neural network training and pruning. The classical view treats network training as a search for weights that minimize loss. Dense networks are often pruned after training by removing low-magnitude weights, reducing parameters and computation. The lottery ticket hypothesis reframes this: dense networks are not learning machines in the traditional sense but lottery ticket machines that identify which of many embedded subnetworks can be developed efficiently.

The experimental protocol is elegant. Start with a dense network randomly initialized with weights w^0. Train it to convergence on a task, obtaining weights w^t*. Identify a pruning mask m (binary, selecting which weights to keep) by selecting high-magnitude weights. Define the "winning ticket" as the subnetwork g(theta_0 ⊙ m), where ⊙ is element-wise multiplication. Crucially, retrain this subnetwork from the original initialization w^0 ⊙ m, not from scratch, and it recovers the original dense network's performance.

Why is this surprising? First, traditional wisdom says retraining a pruned network requires fresh random initialization; starting from pruned dense weights (even at the original magnitude) often performs worse. LTH shows that the solution is to restore weights to their original initialization while keeping the pruning mask. Second, retraining the same pruned topology from a different random initialization fails, suggesting the original random seed encodes useful structure.

LTH has several profound implications. It suggests that randomness in initialization is not truly noise but encodes inductive biases. Different random seeds induce different winning tickets; some seeds are naturally "luckier" than others. It also reframes overparameterization: dense networks contain many subnetworks, and optimization selects one. This selection happens implicitly through gradient descent, which has an implicit bias toward sparse, generalizing solutions (implicit regularization). The hypothesis unifies the success of overparameterized networks with the efficiency of sparse models: you need the overparameterization to find good sparse solutions, but the actual solution is sparse.

Practical implications are significant. If LTH holds, you can dramatically reduce network size and computation by first training dense, then pruning and retraining from the original initialization. This is a form of "progressive shrinking" where you train a large model and extract a smaller, more efficient subnetwork. However, the dense training cost is not reduced, so the practical speedup is limited to inference and storage.

Limitations and open questions remain. LTH has been verified empirically for image classification, but results are more mixed for other domains (NLP, other architectures). The hypothesis fails at very high pruning levels (>99% of weights removed), and the "rewinding" procedure (returning to the original initialization) is non-trivial. Theoretically, explaining *why* the original initialization is special remains open. The mechanism by which sparse subnetworks with original weights can match dense network performance, and what properties of the initialization allow this, are frontiers of research.

The Lottery Ticket Hypothesis has spawned follow-up research on training dynamics, edge rewinding (finding winning tickets earlier in training), and various pruning strategies. It stands as a reminder that deep learning contains structural surprises: dense networks are not monolithic learners but collections of possible learners, and optimization selects among them through mechanisms still not fully understood.

Practice Questions 4 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesSupport Vector MachinesKernel Methods and the Kernel TrickKernel Theory and RKHSRepresenter TheoremRegularization Theory (Tikhonov, Spectral)Implicit RegularizationOverparameterization TheoryLottery Ticket Hypothesis

Longest path: 80 steps · 523 total prerequisite topics

Prerequisites (3)

Leads To (0)

No topics depend on this one yet.