Questions: Stochastic Gradient Descent and Variants

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

A research team trains a neural network using full-batch gradient descent; another uses mini-batch SGD. Which statement best explains a potential advantage of mini-batch SGD?

AMini-batch SGD always converges faster in terms of total computation
BMini-batch SGD computes the exact gradient direction, avoiding errors that accumulate in full-batch
CThe gradient noise in mini-batch SGD can help escape shallow local minima and saddle points that would trap full-batch descent
DMini-batch SGD avoids the need for learning rate tuning, making it simpler to use
Question 2 Multiple Choice

Adam optimizer adapts the learning rate for each parameter based on gradient history. What is the key motivation for this per-parameter adaptation?

ADifferent parameters have different units, so their gradients must be rescaled before comparison
BParameters with consistently large gradients risk overshooting, while parameters with small or infrequent gradients need larger effective steps to learn at all
CAdaptive learning rates guarantee convergence to the global minimum rather than a local minimum
DAdam eliminates the need for momentum because it subsumes momentum's function entirely
Question 3 True / False

The gradient noise introduced by using small mini-batches in SGD can be beneficial, acting as an implicit regularizer and helping the optimizer find flatter, better-generalizing minima.

TTrue
FFalse
Question 4 True / False

Increasing the mini-batch size in SGD typically improves both training speed and final model performance.

TTrue
FFalse
Question 5 Short Answer

For some tasks, well-tuned SGD with momentum achieves better final generalization than Adam. Why might this be, despite Adam's more sophisticated gradient adaptation?

Think about your answer, then reveal below.