Questions — Optimization Algorithms: SGD, Adam, RMSprop

Question 1 Multiple Choice

A team trains a deep neural network with Adam and achieves fast training convergence but poor generalization to the test set. Switching to well-tuned SGD with momentum achieves similar training loss but significantly better test accuracy. What best explains this pattern?

AAdam is fundamentally broken for deep learning and should be replaced in all cases

BAdam's per-parameter adaptive learning rates can converge to sharp minima that interpolate the training data but generalize poorly; SGD with momentum may find flatter minima that generalize better

CAdam uses too much memory, corrupting gradient estimates and causing poor generalization

DSGD with momentum escapes local minima more easily because it lacks adaptive learning rates

Question 2 Multiple Choice

What specific problem does RMSprop solve that vanilla SGD with momentum does not?

AOscillation caused by the learning rate being too high in all parameter directions simultaneously

BDifferent parameters having vastly different gradient magnitudes, so a single learning rate is too large for some and too small for others

CThe inability of gradient descent to escape saddle points in the loss landscape

DThe computational cost of computing full-batch gradients on large datasets

Question 3 True / False

Adam's bias correction step is necessary because the first and second moment estimates are initialized at zero and would underestimate true gradient statistics early in training without it.

TTrue

FFalse

Question 4 True / False

Because Adam adapts the learning rate individually for each parameter, the global learning rate hyperparameter α becomes irrelevant and does not need to be tuned.

TTrue

FFalse

Question 5 Short Answer

Explain how Adam combines the properties of SGD with momentum and RMSprop, and what problem each component solves.

Think about your answer, then reveal below.

Questions: Optimization Algorithms: SGD, Adam, RMSprop