Questions: Optimization Theory for ML

4 questions to test your understanding

Score: 0 / 4
Question 1 Multiple Choice

Full-batch gradient descent converges at rate O(1/T) on smooth convex functions, while SGD converges at rate O(1/sqrt(T)). Given a fixed computational budget, which is faster for a dataset of n = 1,000,000 examples?

AGD is always faster because it converges at a better rate
BSGD, because each SGD iteration costs O(1) (one sample) while each GD iteration costs O(n) = O(10^6), so SGD can perform 10^6 more iterations in the same time — and 10^6 SGD steps at rate O(1/sqrt(T)) beats 1 GD step at rate O(1/T)
CThey are equivalent because the total computation is the same
DIt depends entirely on the learning rate schedule, not the algorithm
Question 2 True / False

SGD with a fixed learning rate does not converge to the exact optimum — it oscillates in a neighborhood of the optimum whose size depends on the learning rate.

TTrue
FFalse
Question 3 Multiple Choice

Strong convexity improves the convergence rate of both GD and SGD. What does strong convexity mean geometrically?

AThe function has no saddle points
BThe function curves upward at least as steeply as a quadratic bowl everywhere — formally, f(y) >= f(x) + gradient(x)·(y-x) + (mu/2)||y-x||^2 for some mu > 0, ensuring a unique minimum with no flat regions
CThe gradient is always nonzero except at the optimum
DThe Hessian matrix has all positive eigenvalues that sum to at least 1
Question 4 Short Answer

Explain why the O(1/sqrt(T)) convergence rate of SGD is optimal for convex stochastic optimization and cannot be improved by any first-order method.

Think about your answer, then reveal below.