A neural network has 1 million parameters and is trained on 10,000 examples. Classical learning theory predicts severe overfitting. Under what conditions might the network still generalize well?
Think about your answer, then reveal below.
Model answer: Generalization is possible if implicit regularization from the optimization algorithm (SGD, GD with small learning rate, weight decay) guides solutions toward those with good generalization properties (small norms, large margins, simple structure). Additionally, the network's architecture (e.g., convolutional structure) encodes inductive biases that prefer smooth, compositional functions. The overparameterization provides capacity to memorize, but the optimization trajectory is biased away from pure memorization toward solutions that generalize. The combination of overparameterization + implicit regularization + inductive bias explains generalization without explicit regularization.
This represents the modern understanding: overparameterization and regularization work together, not against each other. The large parameter count is not a liability but an asset — it provides flexibility that, combined with careful algorithm design, enables learning of simple, generalizing solutions.