Questions: Vanishing Gradient Problem

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

A 15-layer network uses sigmoid activations throughout. During training, you observe that the last few layers train effectively while the first few layers barely change their weights at all. What is the most likely cause?

AThe learning rate is too low for the early layers, which need larger updates than the final layers
BThe early layers have reached a local minimum and stopped updating naturally
CDuring backpropagation, sigmoid derivatives (≤0.25 each) multiply across 15 layers, reducing the gradient to near zero before it reaches the early layers
DThe early layers have more parameters and statistically require more gradient steps to update
Question 2 Multiple Choice

Why does replacing sigmoid activations with ReLU activations help alleviate the vanishing gradient problem?

AReLU has a steeper derivative than sigmoid, which amplifies gradients throughout the network
BFor positive inputs, ReLU's derivative is exactly 1, so gradients pass through that layer without being multiplied by a fraction less than 1
CReLU normalizes the gradient magnitude to a constant value across all layers
DReLU activations skip the backpropagation step for inactive (zero-output) neurons, reducing total gradient computation
Question 3 True / False

The vanishing gradient problem affects most layers of a deep network equally — nearly every layer trains at the same reduced rate.

TTrue
FFalse
Question 4 True / False

Skip connections in residual networks (ResNets) help solve the vanishing gradient problem by providing shortcut paths that allow gradients to flow directly to earlier layers without traversing the full multiplicative chain.

TTrue
FFalse
Question 5 Short Answer

Explain why the vanishing gradient problem specifically prevented training of deep networks rather than just slowing down training uniformly across all layers.

Think about your answer, then reveal below.