A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Vanishing Gradient Problem

Graduate Depth 95 in the knowledge graph ☐ I know this ☆ Set as goal

67topics build on this

645prerequisites beneath it

Activation Functions in Neural Networks Backpropagation Algorithm +1 more→→Gradient Descent and Optimization Stochastic Gradient Descent and Variants

Core Idea

During backpropagation, gradients multiply across layers; with saturating activation functions like sigmoid, gradients near zero cause deep layers to learn very slowly (vanishing gradients) or gradients can grow uncontrollably (exploding gradients). Solutions include careful weight initialization (Xavier, He initialization), gradient clipping, non-saturating activations (ReLU), and architectural innovations like skip connections and gating mechanisms.

How It's Best Learned

Train deep networks with sigmoid activations and observe layer-wise gradient magnitudes, then compare with ReLU networks to see how activation choice affects gradient flow.

Explainer

From backpropagation, you know that training a neural network means computing the gradient of the loss with respect to every weight, then nudging each weight in the direction that reduces the loss. The chain rule makes this possible: the gradient at any layer is the product of local gradients along the path from the output back to that layer. The vanishing gradient problem is what happens when that product shrinks to near zero, effectively cutting off learning for the earlier layers of a deep network.

To see why this happens, consider a network with sigmoid activations. The sigmoid function squashes its input to the range (0, 1), and its derivative peaks at 0.25 and drops toward zero for large or small inputs. During backpropagation, the gradient at each layer is multiplied by the local sigmoid derivative. If that derivative is 0.2 at each layer, then after 10 layers the gradient has been multiplied by 0.2¹⁰ ≈ 0.0000001. The gradient reaching the first layer is astronomically smaller than the gradient at the last layer. Those early layers — which learn fundamental, low-level features — barely update their weights at all. The network appears to train (the last few layers adjust), but the deep layers remain near their random initialization, and the network never learns the hierarchical representations that make deep learning powerful.

The mirror problem is exploding gradients: if local gradients are consistently greater than 1, the product grows exponentially, causing weight updates so large that training becomes numerically unstable (weights oscillate wildly or overflow to infinity). Vanishing and exploding gradients are two sides of the same coin — the instability inherent in multiplying many factors together.

The solutions attack the problem from multiple angles. ReLU (Rectified Linear Unit) activations have a derivative of exactly 1 for positive inputs, so gradients pass through without shrinking. Careful initialization (Xavier for tanh/sigmoid, He for ReLU) sets initial weights so that the variance of activations and gradients stays stable across layers. Gradient clipping caps the gradient norm to prevent explosions. Most fundamentally, skip connections (as in ResNets) add shortcut paths that let gradients flow directly to earlier layers, bypassing the multiplicative chain entirely. These architectural innovations are what made training networks with hundreds of layers feasible — not more data or compute, but solving the gradient flow problem that had bottlenecked deep learning for decades.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning → Neural Network Fundamentals → Backpropagation Algorithm → Multilayer Perceptrons (MLPs) → Activation Functions in Neural Networks → Vanishing Gradient Problem

Longest path: 96 steps · 645 total prerequisite topics

Prerequisites (3)

Backpropagation Algorithmhard Multilayer Perceptrons (MLPs)hard Activation Functions in Neural Networkshard

Leads To (2)

Gradient Descent and Optimizationsoft Stochastic Gradient Descent and Variantssoft