In the Lottery Ticket Hypothesis, what is a 'winning ticket'?
AA dense network that achieves 100% training accuracy
BA sparse subnetwork that, trained from its original weights at original initialization, reaches comparable test accuracy to the original dense network
CA subset of training data that, if used alone, allows perfect generalization
DA random initialization that guarantees fast convergence
A winning ticket is a sparse, well-chosen subnetwork that retains its original weight initialization from the dense network. The critical point is training from the original initialization, not from scratch. Training the same subnetwork topology from scratch (random initialization) typically fails to achieve good performance, showing that the original initialization is crucial. This distinguishes lottery ticket pruning from other pruning methods and suggests that random initialization holds information.
Question 2 Short Answer
The lottery ticket hypothesis claims that pruning weights AFTER training and retraining from the SAME initialization recovers performance. Why is training from the original initialization important?
Think about your answer, then reveal below.
Model answer: If you retrain a pruned subnetwork from a fresh random initialization, it often performs poorly. Training from the original initialization works because that initialization was already compatible with the pruned subnetwork structure — the original initialization implicitly 'chose' which subnetwork to develop. This suggests that random initialization is not truly random but contains implicit structure that biases optimization toward certain solutions. The hypothesis proposes that the dense network, starting from a fixed initialization, had multiple possible paths (winning tickets), and training selected one. Retraining from the same initialization is rewinding to the fork in the road where the original dense training made its choice.
This touches on a deep question: what makes some initializations 'winners' and others 'losers'? The original initialization must encode information that guides optimization toward good solutions. This is a profound finding because it suggests random initialization is not truly uninformative — it constrains the optimization landscape in beneficial ways.
Question 3 Multiple Choice
How does the Lottery Ticket Hypothesis relate to overparameterization and generalization?
ALTH proves that overparameterization is harmful and causes overfitting
BLTH suggests that overparameterization provides redundancy; the network contains multiple generalizing solutions, and optimization selects one via implicit regularization
CLTH has no connection to overparameterization; it is purely about network pruning
DLTH shows that sparse networks always generalize better than dense networks
LTH offers a new perspective on why overparameterized networks generalize: they contain embedded redundancy. A dense network can be thought of as encoding many possible sparse subnetworks, all capable of solving the task. The optimization process (gradient descent on the dense network) selects one subnetwork to develop by setting weights to large values while keeping others near zero. The selected subnetwork generalizes because implicit regularization during dense training ensures the winning ticket inherits good generalization properties. This resolves the puzzle of overfitting: overparameterization provides flexibility that, combined with the right optimization algorithm, allows finding simple (sparse) solutions.
Question 4 True / False
True or False: You can take a pruned lottery ticket subnetwork, randomly shuffle the weights, and retrain from the shuffled initialization while achieving the original dense network's performance.
TTrue
FFalse
Answer: False
False. The lottery ticket hypothesis specifically requires training from the original initialization. Random shuffling destroys the implicit information encoded in the original initialization. This highlights that LTH is not just about pruning to remove redundancy — it is about preserving the specific initialization that enables efficient learning of the winning subnetwork. This is a key distinction and suggests that optimization and initialization are deeply entangled.