A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Gradient Descent and Optimization

Graduate Depth 96 in the knowledge graph ☐ I know this ☆ Set as goal

66topics build on this

673prerequisites beneath it

Critical Points and Local Extrema Directional Derivatives and the Gradient +4 more→→Artificial Potential Field Methods Convex Optimization Fundamentals +9 more

Core Idea

Gradient descent iteratively moves toward minima by stepping in the negative gradient direction. Step size (learning rate) controls convergence: too small is slow, too large diverges. Momentum and adaptive methods improve convergence.

How It's Best Learned

Implement vanilla gradient descent on a convex function, visualizing iterations and comparing with Adam.

Common Misconceptions

Gradient descent finds global minima only for convex functions; non-convex problems may converge to local minima. Smaller learning rates are not always better.

Explainer

The core idea of gradient descent is simple: if you know the slope of a function at your current location, you can decrease the function's value by stepping in the downhill direction. For a scalar parameter θ minimizing loss L, the update is θ ← θ − η · (∂L/∂θ), where η (eta) is the learning rate. In higher dimensions, the gradient ∇L is a vector pointing in the direction of steepest ascent — so the negative gradient points downhill, and you step that way.

The learning rate η controls how far you step each iteration. Setting it too small means you make tiny, cautious moves and convergence takes an enormous number of steps — particularly painful in high-dimensional spaces with flat regions. Setting it too large causes you to overshoot: instead of landing near the minimum at the bottom of a valley, you jump past it, climb the far wall, and bounce back. The learning rate is often the most critical hyperparameter to tune. Common practice is to start with a moderate value (e.g., 0.01) and use a learning rate schedule that decays it over time, or to use an adaptive optimizer.

Vanilla gradient descent computes the gradient using the entire dataset before each update. For modern deep learning with millions of training examples, this is prohibitively expensive. Stochastic gradient descent (SGD) instead estimates the gradient from a single randomly chosen example (or a small mini-batch of ~32–256 examples), making updates far more frequently with noisier estimates. Mini-batch SGD combines the best of both: the noise helps escape sharp local minima, the averaging reduces variance enough to take stable steps, and modern hardware (GPUs) parallelizes mini-batch computation efficiently.

For non-convex loss surfaces — which is essentially every interesting problem in deep learning — gradient descent has no guarantee of finding the global minimum. It follows the local slope and stops wherever it cannot descend further. Empirically, this turns out to be far less of a problem than it sounds, because in high-dimensional parameter spaces most local minima and saddle points have similar loss values to the global minimum. The geometry of high-dimensional loss landscapes is fundamentally different from low-dimensional intuition.

More advanced optimizers build on the gradient descent idea by incorporating momentum and adaptive learning rates. Momentum accumulates a running average of past gradients, effectively giving the optimizer "inertia" that smooths oscillations and accelerates progress in consistent directions. Adam (adaptive moment estimation) maintains per-parameter estimates of both the first moment (mean gradient) and second moment (uncentered variance), using them to normalize the step size for each parameter independently. These methods generally converge faster and more robustly than vanilla SGD on deep learning problems, though SGD with momentum remains competitive for well-tuned training regimes.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem → Gradient Descent and Optimization

Longest path: 97 steps · 673 total prerequisite topics

Prerequisites (6)

Partial Derivatives: Definition and Computationhard Critical Points and Local Extremahard Limits and Continuity in Multiple Variableshard Directional Derivatives and the Gradienthard Derivatives of Exponential Functionssoft Vanishing Gradient Problemsoft

Leads To (11)

Artificial Potential Field Methodssoft Convex Optimization Fundamentalshard Fine-Tuning Pretrained Modelssoft Gradient Boosting Machineshard Loss Functions and Objective Functionshard Online Learning and Regret Boundssoft Optimization Algorithms: SGD, Adam, RMSprophard Optimization Theory for MLhard Policy Gradient Methodshard Stochastic Gradient Descent and Variantshard Transfer Learning in Neural Networkssoft