A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Lottery Ticket Hypothesis

Research Depth 105 in the knowledge graph ☐ I know this ☆ Set as goal

731prerequisites beneath it

Neural Network Approximation Theory Regularization Theory (Tikhonov, Spectral)+1 more→

Core Idea

The Lottery Ticket Hypothesis (LTH), proposed by Frankle and Carbin (2019), posits that dense neural networks contain sparse subnetworks ("winning lottery tickets") that, when trained in isolation from random initialization, achieve test accuracy comparable to the original dense network. A winning ticket is found by training a dense network, pruning low-magnitude weights, and training the remaining weights from their original initialization (not from scratch). This suggests that dense networks redundantly encode multiple possible solutions; training initializes an implicit lottery where some random initializations hit a winning ticket. LTH has profound implications for neural network structure, generalization, and optimization efficiency.

Explainer

The Lottery Ticket Hypothesis challenges how we think about neural network training and pruning. The classical view treats network training as a search for weights that minimize loss. Dense networks are often pruned after training by removing low-magnitude weights, reducing parameters and computation. The lottery ticket hypothesis reframes this: dense networks are not learning machines in the traditional sense but lottery ticket machines that identify which of many embedded subnetworks can be developed efficiently.

The experimental protocol is elegant. Start with a dense network randomly initialized with weights w⁰. Train it to convergence on a task, obtaining weights w^t*. Identify a pruning mask m (binary, selecting which weights to keep) by selecting high-magnitude weights. Define the "winning ticket" as the subnetwork g(theta_0 ⊙ m), where ⊙ is element-wise multiplication. Crucially, retrain this subnetwork from the original initialization w⁰ ⊙ m, not from scratch, and it recovers the original dense network's performance.

Why is this surprising? First, traditional wisdom says retraining a pruned network requires fresh random initialization; starting from pruned dense weights (even at the original magnitude) often performs worse. LTH shows that the solution is to restore weights to their original initialization while keeping the pruning mask. Second, retraining the same pruned topology from a different random initialization fails, suggesting the original random seed encodes useful structure.

LTH has several profound implications. It suggests that randomness in initialization is not truly noise but encodes inductive biases. Different random seeds induce different winning tickets; some seeds are naturally "luckier" than others. It also reframes overparameterization: dense networks contain many subnetworks, and optimization selects one. This selection happens implicitly through gradient descent, which has an implicit bias toward sparse, generalizing solutions (implicit regularization). The hypothesis unifies the success of overparameterized networks with the efficiency of sparse models: you need the overparameterization to find good sparse solutions, but the actual solution is sparse.

Practical implications are significant. If LTH holds, you can dramatically reduce network size and computation by first training dense, then pruning and retraining from the original initialization. This is a form of "progressive shrinking" where you train a large model and extract a smaller, more efficient subnetwork. However, the dense training cost is not reduced, so the practical speedup is limited to inference and storage.

Limitations and open questions remain. LTH has been verified empirically for image classification, but results are more mixed for other domains (NLP, other architectures). The hypothesis fails at very high pruning levels (>99% of weights removed), and the "rewinding" procedure (returning to the original initialization) is non-trivial. Theoretically, explaining *why* the original initialization is special remains open. The mechanism by which sparse subnetworks with original weights can match dense network performance, and what properties of the initialization allow this, are frontiers of research.

The Lottery Ticket Hypothesis has spawned follow-up research on training dynamics, edge rewinding (finding winning tickets earlier in training), and various pruning strategies. It stands as a reminder that deep learning contains structural surprises: dense networks are not monolithic learners but collections of possible learners, and optimization selects among them through mechanisms still not fully understood.

Practice Questions 4 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem → Gradient Descent and Optimization → Gradient Boosting Machines → Support Vector Machines → Kernel Methods and the Kernel Trick → Kernel Theory and RKHS → Representer Theorem → Regularization Theory (Tikhonov, Spectral) → Implicit Regularization → Overparameterization Theory → Lottery Ticket Hypothesis

Longest path: 106 steps · 731 total prerequisite topics

Prerequisites (3)

Neural Network Approximation Theoryhard Regularization Theory (Tikhonov, Spectral)hard Overparameterization Theorysoft

Leads To (0)

No topics depend on this one yet.