Questions — Stochastic Gradient Descent and Variants

Question 1 Multiple Choice

A research team trains a neural network using full-batch gradient descent; another uses mini-batch SGD. Which statement best explains a potential advantage of mini-batch SGD?

AMini-batch SGD always converges faster in terms of total computation

BMini-batch SGD computes the exact gradient direction, avoiding errors that accumulate in full-batch

CThe gradient noise in mini-batch SGD can help escape shallow local minima and saddle points that would trap full-batch descent

DMini-batch SGD avoids the need for learning rate tuning, making it simpler to use

Question 2 Multiple Choice

Adam optimizer adapts the learning rate for each parameter based on gradient history. What is the key motivation for this per-parameter adaptation?

ADifferent parameters have different units, so their gradients must be rescaled before comparison

BParameters with consistently large gradients risk overshooting, while parameters with small or infrequent gradients need larger effective steps to learn at all

CAdaptive learning rates guarantee convergence to the global minimum rather than a local minimum

DAdam eliminates the need for momentum because it subsumes momentum's function entirely

Question 3 True / False

The gradient noise introduced by using small mini-batches in SGD can be beneficial, acting as an implicit regularizer and helping the optimizer find flatter, better-generalizing minima.

TTrue

FFalse

Question 4 True / False

Increasing the mini-batch size in SGD typically improves both training speed and final model performance.

TTrue

FFalse

Question 5 Short Answer

For some tasks, well-tuned SGD with momentum achieves better final generalization than Adam. Why might this be, despite Adam's more sophisticated gradient adaptation?

Think about your answer, then reveal below.

Questions: Stochastic Gradient Descent and Variants