← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Optimization Algorithms: SGD, Adam, RMSprop

Graduate Depth 100 in the knowledge graph ☐ I know this ☆ Set as goal

3topics build on this

747prerequisites beneath it

See this on the map →

Gradient Descent and Optimization Loss Functions and Objective Functions +4 more→→Hyperparameter Optimization

Core Idea

Modern optimizers like Adam and RMSprop adapt learning rates per parameter using gradient history, improving convergence over vanilla SGD. Adam (Adaptive Moment Estimation) combines momentum and RMSprop, making it robust across diverse problems. Optimizer choice affects convergence speed and stability, though learning rate scheduling may be necessary regardless.

Explainer

You already understand that stochastic gradient descent updates parameters by stepping in the direction opposite to the gradient, scaled by a learning rate. The problem is that a single fixed learning rate rarely works well across all parameters. Some parameters may have steep, well-defined gradients and converge quickly, while others sit on flat plateaus where the gradient is tiny and progress is glacially slow. Worse, the loss landscape often has different curvatures in different directions — narrow ravines where the gradient oscillates wildly along one axis while barely moving along the perpendicular one. The family of adaptive optimizers solves this by giving each parameter its own effective learning rate, automatically tuned from gradient history.

SGD with momentum is the first step beyond vanilla SGD. Instead of using only the current gradient, it maintains a running average of past gradients (the "velocity") and uses that to update parameters. This smooths out noisy oscillations and accelerates movement through flat regions — like a ball rolling downhill that accumulates speed. Mathematically, the velocity v is updated as v ← βv + (1 − β)∇L, and then parameters are updated by θ ← θ − α·v, where β (typically 0.9) controls how much history to keep. Momentum solves the oscillation problem but still uses a single learning rate α for every parameter.

RMSprop (Root Mean Square Propagation) takes a different approach. Instead of accumulating gradient direction, it tracks the magnitude of recent gradients for each parameter using an exponential moving average of squared gradients. Parameters whose gradients have been consistently large get their learning rate reduced; parameters with small gradients get a boost. The update divides the gradient by the square root of this running average: θ ← θ − (α / √(E[g²] + ε)) · g. This per-parameter scaling means the optimizer automatically adapts to the local curvature of the loss surface — steep directions get dampened, flat directions get amplified.

Adam (Adaptive Moment Estimation) combines both ideas. It maintains two running averages: the first moment (mean of gradients, like momentum) and the second moment (mean of squared gradients, like RMSprop). It also applies bias correction to account for the fact that these running averages start at zero and are initially biased toward smaller values. The result is an optimizer that both accelerates through flat regions (momentum) and adapts step sizes per parameter (RMSprop), with the bias correction ensuring stable behavior in early training. Adam's default hyperparameters (β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸) work well across a remarkably wide range of problems, which is why it has become the default choice for training neural networks. However, Adam can sometimes generalize worse than well-tuned SGD with momentum, and variants like AdamW (which decouples weight decay from the adaptive update) address some of these shortcomings.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Tree Structure and Node Properties → Binary Trees → Tree Traversals → Depth-First Search (DFS) → Depth-First Search: Implementation and Applications → Topological Sort → Dynamic Programming → Longest Common Subsequence (LCS) Problem → Edit Distance: Levenshtein Distance and DP → 0/1 Knapsack Problem: Bounded Capacity DP → Greedy Algorithms → Activity Selection Problem Using Greedy Algorithms → Dijkstra's Algorithm → A* Search Algorithm → Heuristic Search Functions → Local Search Optimization → Genetic Algorithms → Stochastic Gradient Descent and Variants → Optimization Algorithms: SGD, Adam, RMSprop

Longest path: 101 steps · 747 total prerequisite topics

Prerequisites (6)

Gradient Descent and Optimizationhard Stochastic Gradient Descent and Variantshard Loss Functions and Objective Functionshard Partial Derivatives: Definition and Computationsoft Critical Points and Local Extremasoft Genetic Algorithmssoft

Leads To (1)

Hyperparameter Optimizationsoft