A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Stochastic Gradient Descent and Variants

Graduate Depth 99 in the knowledge graph ☐ I know this ☆ Set as goal

7topics build on this

745prerequisites beneath it

Gradient Descent and Optimization Genetic Algorithms +3 more→→Batch Normalization Optimization Algorithms: SGD, Adam, RMSprop +1 more

optimization learning-algorithms

Core Idea

SGD updates parameters using single examples or small batches instead of full datasets, enabling online learning and large-scale training. Mini-batch SGD balances gradient quality and efficiency. Momentum, Adam, and adaptive methods adjust learning rates per parameter.

Explainer

Standard gradient descent computes the gradient of the loss function over the entire training set before making a single parameter update. You know from your study of gradient descent that this gives you the true gradient direction — the steepest downhill path on the loss surface. But when your dataset has millions of examples, computing the full gradient for every single step is prohibitively expensive. Stochastic gradient descent makes a simple trade: instead of computing the exact gradient, estimate it from a single randomly sampled training example (or a small mini-batch of examples) and update immediately. Each individual estimate is noisy — it might point somewhat away from the true gradient direction — but on average across many updates, it points the right way.

This noise is not purely a disadvantage. The stochastic fluctuations help SGD escape shallow local minima and saddle points that would trap full-batch gradient descent. Think of it like navigating a hilly landscape in fog: full-batch descent carefully computes the exact slope and walks precisely downhill, but it might get stuck in a small depression. SGD stumbles around more randomly, but that stumbling can bounce it out of shallow traps and toward deeper, more robust valleys. In practice, mini-batch SGD — using batches of 32 to 512 examples — strikes the best balance. The batch is large enough to smooth out the wildest noise and exploit GPU parallelism, but small enough to retain the regularizing benefit of stochasticity and allow many updates per pass through the data.

The learning rate is the most critical hyperparameter. Too large, and the updates overshoot, causing the loss to diverge. Too small, and convergence is painfully slow. Momentum addresses a related problem: in narrow valleys of the loss landscape, vanilla SGD oscillates back and forth across the valley while making slow progress along it. Momentum adds a velocity term — each update accumulates a fraction of previous gradients, smoothing the trajectory. It is analogous to a ball rolling downhill that builds speed in consistent directions and dampens oscillations in inconsistent ones.

Adaptive methods like AdaGrad, RMSProp, and Adam take this further by maintaining separate learning rates for each parameter. Parameters with consistently large gradients get smaller effective learning rates (preventing overshooting), while parameters with small or infrequent gradients get larger ones (accelerating learning in flat directions). Adam combines momentum with per-parameter rate adaptation and includes bias corrections for the early training steps. It has become the default optimizer in deep learning because it is robust across a wide range of architectures and hyperparameter settings — though for some tasks, well-tuned SGD with momentum still achieves better final performance, trading convenience for a slight edge in generalization.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Tree Structure and Node Properties → Binary Trees → Tree Traversals → Depth-First Search (DFS) → Depth-First Search: Implementation and Applications → Topological Sort → Dynamic Programming → Longest Common Subsequence (LCS) Problem → Edit Distance: Levenshtein Distance and DP → 0/1 Knapsack Problem: Bounded Capacity DP → Greedy Algorithms → Activity Selection Problem Using Greedy Algorithms → Dijkstra's Algorithm → A* Search Algorithm → Heuristic Search Functions → Local Search Optimization → Genetic Algorithms → Stochastic Gradient Descent and Variants

Longest path: 100 steps · 745 total prerequisite topics

Prerequisites (5)

Gradient Descent and Optimizationhard Partial Derivatives: Definition and Computationsoft Probability Axiomssoft Vanishing Gradient Problemsoft Genetic Algorithmssoft

Leads To (3)

Batch Normalizationhard Optimization Algorithms: SGD, Adam, RMSprophard Simulated Annealingsoft