Optimization Algorithms: SGD, Adam, RMSprop

Graduate Depth 75 in the knowledge graph I know this Set as goal
optimization gradient adam rmsprop

Core Idea

Modern optimizers like Adam and RMSprop adapt learning rates per parameter using gradient history, improving convergence over vanilla SGD. Adam (Adaptive Moment Estimation) combines momentum and RMSprop, making it robust across diverse problems. Optimizer choice affects convergence speed and stability, though learning rate scheduling may be necessary regardless.

Explainer

You already understand that stochastic gradient descent updates parameters by stepping in the direction opposite to the gradient, scaled by a learning rate. The problem is that a single fixed learning rate rarely works well across all parameters. Some parameters may have steep, well-defined gradients and converge quickly, while others sit on flat plateaus where the gradient is tiny and progress is glacially slow. Worse, the loss landscape often has different curvatures in different directions — narrow ravines where the gradient oscillates wildly along one axis while barely moving along the perpendicular one. The family of adaptive optimizers solves this by giving each parameter its own effective learning rate, automatically tuned from gradient history.

SGD with momentum is the first step beyond vanilla SGD. Instead of using only the current gradient, it maintains a running average of past gradients (the "velocity") and uses that to update parameters. This smooths out noisy oscillations and accelerates movement through flat regions — like a ball rolling downhill that accumulates speed. Mathematically, the velocity v is updated as v ← βv + (1 − β)∇L, and then parameters are updated by θ ← θ − α·v, where β (typically 0.9) controls how much history to keep. Momentum solves the oscillation problem but still uses a single learning rate α for every parameter.

RMSprop (Root Mean Square Propagation) takes a different approach. Instead of accumulating gradient direction, it tracks the magnitude of recent gradients for each parameter using an exponential moving average of squared gradients. Parameters whose gradients have been consistently large get their learning rate reduced; parameters with small gradients get a boost. The update divides the gradient by the square root of this running average: θ ← θ − (α / √(E[g²] + ε)) · g. This per-parameter scaling means the optimizer automatically adapts to the local curvature of the loss surface — steep directions get dampened, flat directions get amplified.

Adam (Adaptive Moment Estimation) combines both ideas. It maintains two running averages: the first moment (mean of gradients, like momentum) and the second moment (mean of squared gradients, like RMSprop). It also applies bias correction to account for the fact that these running averages start at zero and are initially biased toward smaller values. The result is an optimizer that both accelerates through flat regions (momentum) and adapts step sizes per parameter (RMSprop), with the bias correction ensuring stable behavior in early training. Adam's default hyperparameters (β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸) work well across a remarkably wide range of problems, which is why it has become the default choice for training neural networks. However, Adam can sometimes generalize worse than well-tuned SGD with momentum, and variants like AdamW (which decouples weight decay from the adaptive update) address some of these shortcomings.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsSystems of Equations — Graphing MethodSystems of Equations — Elimination MethodSystems of Three VariablesMatrices IntroductionGraph Representation: Matrices and ListsGraph Paths, Cycles, and ConnectivityTrees and Spanning TreesBinary TreesTree TraversalsBreadth-First Search (BFS)Topological SortDynamic ProgrammingEdit Distance: Levenshtein Distance and DP0/1 Knapsack Problem: Bounded Capacity DPGreedy AlgorithmsActivity Selection Problem Using Greedy AlgorithmsDijkstra's AlgorithmA* Search AlgorithmHeuristic Search FunctionsLocal Search OptimizationGenetic AlgorithmsStochastic Gradient Descent and VariantsOptimization Algorithms: SGD, Adam, RMSprop

Longest path: 76 steps · 535 total prerequisite topics

Prerequisites (5)

Leads To (0)

No topics depend on this one yet.