A 15-layer network uses sigmoid activations throughout. During training, you observe that the last few layers train effectively while the first few layers barely change their weights at all. What is the most likely cause?
AThe learning rate is too low for the early layers, which need larger updates than the final layers
BThe early layers have reached a local minimum and stopped updating naturally
CDuring backpropagation, sigmoid derivatives (≤0.25 each) multiply across 15 layers, reducing the gradient to near zero before it reaches the early layers
DThe early layers have more parameters and statistically require more gradient steps to update
The sigmoid derivative peaks at 0.25. Multiplied across 15 layers, a chain of 0.2 factors gives 0.2¹⁵ ≈ 3×10⁻¹¹ — effectively zero. The gradient at the first layer is astronomically smaller than at the last layer, so early layer weights barely update. This is the vanishing gradient problem: it is caused by repeated multiplication of small fractions through the chain rule, not by the learning rate or local minima.
Question 2 Multiple Choice
Why does replacing sigmoid activations with ReLU activations help alleviate the vanishing gradient problem?
AReLU has a steeper derivative than sigmoid, which amplifies gradients throughout the network
BFor positive inputs, ReLU's derivative is exactly 1, so gradients pass through that layer without being multiplied by a fraction less than 1
CReLU normalizes the gradient magnitude to a constant value across all layers
DReLU activations skip the backpropagation step for inactive (zero-output) neurons, reducing total gradient computation
ReLU is defined as max(0, x), so its derivative is 1 for positive inputs and 0 for negative inputs. For active neurons, the gradient passes through multiplied by 1 — not by a small fraction. This breaks the exponential shrinkage that afflicts sigmoid networks. ReLU does introduce the 'dying ReLU' problem (neurons that output 0 have zero gradient and stop learning), but this is far less severe than the universal gradient starvation caused by sigmoid in deep networks.
Question 3 True / False
The vanishing gradient problem affects most layers of a deep network equally — nearly every layer trains at the same reduced rate.
TTrue
FFalse
Answer: False
The problem is specifically worse for early (deeper) layers. Because backpropagation computes gradients by multiplying local derivatives back through each layer, the gradient that reaches layer 1 has been multiplied by many more small factors than the gradient reaching layer 14. The last few layers (closest to the loss function) receive large gradients and train effectively. The first few layers receive near-zero gradients and stagnate near their random initialization. This asymmetry is what makes the problem so damaging: the deep layers that should learn fundamental features simply don't update.
Question 4 True / False
Skip connections in residual networks (ResNets) help solve the vanishing gradient problem by providing shortcut paths that allow gradients to flow directly to earlier layers without traversing the full multiplicative chain.
TTrue
FFalse
Answer: True
A residual block computes F(x) + x, where the skip connection adds the input x directly to the block's output. During backpropagation, the gradient flows both through the residual function F(x) (which may shrink) AND directly through the identity path (unchanged). Even if the F(x) path has near-zero gradient, the identity shortcut ensures the early layers still receive a meaningful gradient signal. This architectural innovation — not just a different activation function — is what made training networks with hundreds of layers feasible.
Question 5 Short Answer
Explain why the vanishing gradient problem specifically prevented training of deep networks rather than just slowing down training uniformly across all layers.
Think about your answer, then reveal below.
Model answer: Backpropagation computes gradients by multiplying derivatives along the chain from the output back to each layer. With saturating activations like sigmoid (max derivative 0.25), each layer shrinks the gradient by at least 75%. After many layers, this multiplication makes the gradient reaching early layers exponentially smaller than the gradient at later layers — not slightly smaller, but orders of magnitude smaller. The last few layers train at a normal rate; the first few layers effectively receive zero gradient and remain near random initialization. The network learns nothing hierarchical in its early layers, making depth useless.
This is why 'just train longer' does not solve the problem: the early layers aren't training slowly, they're not training at all. The gradient they receive is so close to zero that even thousands of extra epochs would produce negligible weight updates. Solutions must address the root cause — preventing the multiplicative shrinkage — rather than compensating for it with more compute.