Full-batch gradient descent converges at rate O(1/T) on smooth convex functions, while SGD converges at rate O(1/sqrt(T)). Given a fixed computational budget, which is faster for a dataset of n = 1,000,000 examples?
AGD is always faster because it converges at a better rate
BSGD, because each SGD iteration costs O(1) (one sample) while each GD iteration costs O(n) = O(10^6), so SGD can perform 10^6 more iterations in the same time — and 10^6 SGD steps at rate O(1/sqrt(T)) beats 1 GD step at rate O(1/T)
CThey are equivalent because the total computation is the same
DIt depends entirely on the learning rate schedule, not the algorithm
The key comparison is computational cost. GD uses the full gradient (n gradient computations per step), converging as O(1/T_GD). SGD uses one gradient per step, converging as O(1/sqrt(T_SGD)). With the same total computation C, GD runs T_GD = C/n steps and SGD runs T_SGD = C steps. GD error: O(n/C). SGD error: O(1/sqrt(C)). For large n (10^6), 1/sqrt(C) < n/C when C > n^2 = 10^12, which is extremely large. For moderate budgets, SGD is dramatically better because it processes orders of magnitude more updates. This is why SGD dominates practical ML training.
Question 2 True / False
SGD with a fixed learning rate does not converge to the exact optimum — it oscillates in a neighborhood of the optimum whose size depends on the learning rate.
TTrue
FFalse
Answer: True
With a fixed learning rate eta, SGD's noise (the variance of the stochastic gradient) prevents convergence to the exact optimum. The expected distance to the optimum converges to a ball of radius proportional to eta * sigma^2, where sigma^2 is the gradient noise variance. To converge to the exact optimum, the learning rate must decrease (e.g., eta_t = 1/t), but this slows convergence. The fixed-learning-rate oscillation is not entirely harmful: in deep learning, this 'implicit noise' has been argued to help generalization by preventing the model from settling into sharp minima. The Robbins-Monro conditions (sum eta_t = infinity, sum eta_t^2 < infinity) are necessary and sufficient for SGD to converge to the optimum.
Question 3 Multiple Choice
Strong convexity improves the convergence rate of both GD and SGD. What does strong convexity mean geometrically?
AThe function has no saddle points
BThe function curves upward at least as steeply as a quadratic bowl everywhere — formally, f(y) >= f(x) + gradient(x)·(y-x) + (mu/2)||y-x||^2 for some mu > 0, ensuring a unique minimum with no flat regions
CThe gradient is always nonzero except at the optimum
DThe Hessian matrix has all positive eigenvalues that sum to at least 1
Strong convexity with parameter mu means the function is lower-bounded by a quadratic that opens upward with curvature mu at every point. This prevents flat regions and plateaus where gradient descent would make tiny steps. The condition guarantees a unique minimum and ensures the gradient grows linearly with the distance to the optimum: ||gradient|| >= mu * ||x - x*||. This linear growth drives exponential convergence for GD (each step makes proportional progress) and O(1/T) convergence for SGD (improved from O(1/sqrt(T))). Common ML losses like ridge regression are strongly convex; plain logistic regression is convex but not strongly convex.
Question 4 Short Answer
Explain why the O(1/sqrt(T)) convergence rate of SGD is optimal for convex stochastic optimization and cannot be improved by any first-order method.
Think about your answer, then reveal below.
Model answer: The O(1/sqrt(T)) rate for SGD on general convex functions with stochastic gradients is a minimax lower bound: there exist convex functions and stochastic gradient oracles such that no algorithm using only stochastic gradient information can converge faster than O(1/sqrt(T)). The bottleneck is the noise in the stochastic gradient — each gradient estimate has variance sigma^2, and after T steps, the best possible estimation of the gradient direction accumulates O(sigma * sqrt(T)) total noise. The suboptimality after T steps is at least sigma/sqrt(T). This is not a limitation of SGD specifically but of the stochastic information model: with noisy gradients, you fundamentally cannot extract more than O(sqrt(T)) bits of useful information about the optimum's location in T queries. Variance reduction methods (like SVRG, SAGA) break this barrier by using full-gradient corrections, but they require periodic passes over the entire dataset.
The lower bound proof constructs an adversarial function where the stochastic gradient noise perfectly hides the direction to the optimum. It is an information-theoretic argument: the noise limits how much the learner can learn per step, regardless of computational power.