In a neural network with L layers, backpropagation computes the gradient of the loss with respect to the weights in layer l. What does it propagate backward through the network to accomplish this?
AThe raw activation values from the forward pass
BThe predicted output values for each training example
CError signals (partial derivatives of the loss) from later layers
DThe learning rate scaled by the layer index
Backpropagation applies the chain rule: to get ∂L/∂W_l, you need ∂L/∂a_l (the error signal at layer l's output), which depends on error signals from all layers after l. These error signals — not raw activations — are what get propagated backward. The forward pass already computed and stored the activations; the backward pass uses them together with the propagated error signals.
Question 2 True / False
Backpropagation can primarily be applied to neural networks that use sigmoid activation functions.
TTrue
FFalse
Answer: False
Backpropagation requires only that each operation in the computation graph be differentiable (or sub-differentiable). It applies equally to ReLU, tanh, softmax, and any other differentiable activation. More broadly, backpropagation is just reverse-mode automatic differentiation and works on any differentiable computational graph — not just neural networks.
Question 3 Short Answer
What is the vanishing gradient problem, and in what part of the network does it cause the most harm?
Think about your answer, then reveal below.
Model answer: During backpropagation, gradients are multiplied together as they travel through layers. If each layer's gradient is less than 1 (as commonly happens with saturating activations like sigmoid), the product shrinks exponentially with depth. Early layers receive near-zero gradients and barely learn.
The chain rule means the gradient at layer l is a product of all the local gradients in layers l+1 through L. Sigmoid outputs are bounded in (0,1), and its derivative peaks at 0.25 — repeated multiplication quickly drives signals toward zero. This is why deep networks with sigmoid/tanh struggled before ReLU activations and batch normalization became standard.