5 questions to test your understanding
A team trains a deep neural network with Adam and achieves fast training convergence but poor generalization to the test set. Switching to well-tuned SGD with momentum achieves similar training loss but significantly better test accuracy. What best explains this pattern?
What specific problem does RMSprop solve that vanilla SGD with momentum does not?
Adam's bias correction step is necessary because the first and second moment estimates are initialized at zero and would underestimate true gradient statistics early in training without it.
Because Adam adapts the learning rate individually for each parameter, the global learning rate hyperparameter α becomes irrelevant and does not need to be tuned.
Explain how Adam combines the properties of SGD with momentum and RMSprop, and what problem each component solves.