Questions — Vanishing Gradient Problem

Question 1 Multiple Choice

A 15-layer network uses sigmoid activations throughout. During training, you observe that the last few layers train effectively while the first few layers barely change their weights at all. What is the most likely cause?

AThe learning rate is too low for the early layers, which need larger updates than the final layers

BThe early layers have reached a local minimum and stopped updating naturally

CDuring backpropagation, sigmoid derivatives (≤0.25 each) multiply across 15 layers, reducing the gradient to near zero before it reaches the early layers

DThe early layers have more parameters and statistically require more gradient steps to update

Question 2 Multiple Choice

Why does replacing sigmoid activations with ReLU activations help alleviate the vanishing gradient problem?

AReLU has a steeper derivative than sigmoid, which amplifies gradients throughout the network

BFor positive inputs, ReLU's derivative is exactly 1, so gradients pass through that layer without being multiplied by a fraction less than 1

CReLU normalizes the gradient magnitude to a constant value across all layers

DReLU activations skip the backpropagation step for inactive (zero-output) neurons, reducing total gradient computation

Question 3 True / False

The vanishing gradient problem affects most layers of a deep network equally — nearly every layer trains at the same reduced rate.

TTrue

FFalse

Question 4 True / False

Skip connections in residual networks (ResNets) help solve the vanishing gradient problem by providing shortcut paths that allow gradients to flow directly to earlier layers without traversing the full multiplicative chain.

TTrue

FFalse

Question 5 Short Answer

Explain why the vanishing gradient problem specifically prevented training of deep networks rather than just slowing down training uniformly across all layers.

Think about your answer, then reveal below.

Questions: Vanishing Gradient Problem