Why does information bottleneck provide a theoretical explanation for why deep neural networks generalize despite having millions of parameters?
Think about your answer, then reveal below.
Model answer: Information bottleneck theory suggests that during training, neural networks undergo two phases: a fitting phase where I(T; Y) increases (the network learns to predict the target), and a compression phase where I(T; X) decreases (the network forgets irrelevant details of the input). In the compression phase, the network discards noise and spurious correlations, learning a minimalist representation that explains the target. This automatic compression, achieved through the network's information structure and gradient descent dynamics, provides implicit regularization: the learned representation is simple enough to generalize because it retains only essential information.
The IB principle offers an information-theoretic perspective on generalization: good representations are compressed representations. A network with high capacity that learns a compressed representation will generalize well because it has extracted structure rather than memorizing. This connects implicit regularization, Occam's Razor (prefer simple explanations), and generalization into a unified information-theoretic framework.