Questions: Stochastic Gradient Descent and Variants
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A research team trains a neural network using full-batch gradient descent; another uses mini-batch SGD. Which statement best explains a potential advantage of mini-batch SGD?
AMini-batch SGD always converges faster in terms of total computation
BMini-batch SGD computes the exact gradient direction, avoiding errors that accumulate in full-batch
CThe gradient noise in mini-batch SGD can help escape shallow local minima and saddle points that would trap full-batch descent
DMini-batch SGD avoids the need for learning rate tuning, making it simpler to use
Full-batch gradient descent computes the exact true gradient and follows it precisely — which means it can get trapped in shallow local minima or saddle points because there's no perturbation to bounce it out. Mini-batch SGD's noisy gradient estimates act like random perturbations that can escape these traps. This noise is a feature, not just a bug. However, SGD does not compute the exact gradient (option B is wrong), and it still requires careful learning rate tuning (option D is wrong).
Question 2 Multiple Choice
Adam optimizer adapts the learning rate for each parameter based on gradient history. What is the key motivation for this per-parameter adaptation?
ADifferent parameters have different units, so their gradients must be rescaled before comparison
BParameters with consistently large gradients risk overshooting, while parameters with small or infrequent gradients need larger effective steps to learn at all
CAdaptive learning rates guarantee convergence to the global minimum rather than a local minimum
DAdam eliminates the need for momentum because it subsumes momentum's function entirely
Adam addresses heterogeneous gradient magnitudes. Parameters that receive large, frequent gradient updates risk overshooting — Adam dampens their effective learning rate. Parameters with small or rare gradients (common in sparse features) would barely move under a fixed global rate — Adam amplifies their effective steps. This per-parameter adaptation makes Adam robust across diverse architectures. Note that Adam actually incorporates momentum (option D is false), and adaptive rates do not guarantee global convergence (option C is false).
Question 3 True / False
The gradient noise introduced by using small mini-batches in SGD can be beneficial, acting as an implicit regularizer and helping the optimizer find flatter, better-generalizing minima.
TTrue
FFalse
Answer: True
This is a well-documented property of SGD. Stochastic fluctuations prevent the optimizer from settling into sharp, narrow minima — it tends to find flatter regions of the loss landscape, which often generalize better to new data. This is part of why well-tuned SGD with momentum sometimes achieves better test accuracy than Adam even if Adam converges faster during training. The noise is not merely tolerated — it provides regularization that pure full-batch methods lack.
Question 4 True / False
Increasing the mini-batch size in SGD typically improves both training speed and final model performance.
TTrue
FFalse
Answer: False
Larger batch sizes reduce gradient noise, making each update more accurate, and they can exploit hardware parallelism. But beyond a certain batch size, the gradient noise that helps SGD escape shallow minima is eliminated, often leading to convergence to sharper minima that generalize less well. In practice, there is a sweet spot (often 32–512) that balances gradient quality, computational efficiency, and the regularizing benefit of stochasticity.
Question 5 Short Answer
For some tasks, well-tuned SGD with momentum achieves better final generalization than Adam. Why might this be, despite Adam's more sophisticated gradient adaptation?
Think about your answer, then reveal below.
Model answer: Adam's per-parameter adaptive learning rates allow fast convergence, but they can cause convergence to sharper, narrower minima that generalize less well to new data. Well-tuned SGD with momentum retains more gradient noise throughout training, which acts as implicit regularization that steers it toward flatter minima. The tradeoff is that SGD requires more careful hyperparameter tuning and may take longer to converge.
This illustrates a general principle: the fastest optimizer is not always the best optimizer. In deep learning, generalization (performance on new data) matters more than training loss minimization. Adam's adaptive rates make it a robust low-effort choice, but SGD's noise can act as a regularizer that steers it toward solutions that transfer better. This has been shown empirically on image classification benchmarks where SGD with momentum still holds strong results despite Adam's convenience.