Questions: Optimization Algorithms: SGD, Adam, RMSprop

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

A team trains a deep neural network with Adam and achieves fast training convergence but poor generalization to the test set. Switching to well-tuned SGD with momentum achieves similar training loss but significantly better test accuracy. What best explains this pattern?

AAdam is fundamentally broken for deep learning and should be replaced in all cases
BAdam's per-parameter adaptive learning rates can converge to sharp minima that interpolate the training data but generalize poorly; SGD with momentum may find flatter minima that generalize better
CAdam uses too much memory, corrupting gradient estimates and causing poor generalization
DSGD with momentum escapes local minima more easily because it lacks adaptive learning rates
Question 2 Multiple Choice

What specific problem does RMSprop solve that vanilla SGD with momentum does not?

AOscillation caused by the learning rate being too high in all parameter directions simultaneously
BDifferent parameters having vastly different gradient magnitudes, so a single learning rate is too large for some and too small for others
CThe inability of gradient descent to escape saddle points in the loss landscape
DThe computational cost of computing full-batch gradients on large datasets
Question 3 True / False

Adam's bias correction step is necessary because the first and second moment estimates are initialized at zero and would underestimate true gradient statistics early in training without it.

TTrue
FFalse
Question 4 True / False

Because Adam adapts the learning rate individually for each parameter, the global learning rate hyperparameter α becomes irrelevant and does not need to be tuned.

TTrue
FFalse
Question 5 Short Answer

Explain how Adam combines the properties of SGD with momentum and RMSprop, and what problem each component solves.

Think about your answer, then reveal below.