When training a neural network with gradient descent, the loss stops decreasing and oscillates around a high value. What is the most likely cause?
AThe learning rate is too small
BThe learning rate is too large
CThe model has too few parameters
DThe loss function is non-differentiable
A large learning rate causes the parameter update to overshoot the minimum: the algorithm jumps across the valley and climbs the other side, then jumps back, oscillating without converging. A too-small learning rate causes slow but steady progress, not oscillation. The fix is to reduce the learning rate (or use an adaptive optimizer like Adam that adjusts step sizes per-parameter).
Question 2 True / False
Gradient descent on a non-convex loss function is very likely to find the global minimum if you run it for enough iterations.
TTrue
FFalse
Answer: False
On non-convex surfaces, gradient descent follows the slope downhill and stops at a local minimum — it has no mechanism to escape. Whether it finds the global minimum depends on the starting point and the loss landscape. In practice, deep neural networks have extremely high-dimensional non-convex losses, yet gradient descent works well because most local minima and saddle points in high dimensions have similar loss values, and global minima are not necessarily needed for good generalization.
Question 3 Short Answer
Vanilla gradient descent computes the gradient over the entire dataset before each update. What problem does stochastic gradient descent (SGD) address, and what tradeoff does it introduce?
Think about your answer, then reveal below.
Model answer: SGD uses a single example (or small mini-batch) per update, making each step much cheaper and enabling updates during a single pass through the data. The tradeoff is that the gradient estimate is noisy, causing irregular steps — but this noise can help escape sharp local minima.
For large datasets, computing the full gradient once requires examining every training example — prohibitively expensive. SGD approximates the true gradient cheaply, allowing many parameter updates per epoch. The variance in gradient estimates introduces randomness that acts as implicit regularization and can help avoid overfitting to specific training examples.