5 questions to test your understanding
A researcher builds a 10-layer neural network using only linear transformations between layers — no activation functions. What is the effective expressive power of this network compared to a single-layer linear model?
A multi-class classifier with 5 output classes uses ReLU as its output layer activation. What is the primary problem with this design?
Stacking more linear layers without activation functions allows a neural network to model increasingly complex, nonlinear decision boundaries.
ReLU avoids the vanishing gradient problem for positive inputs because its derivative is exactly 1, allowing gradient signals to flow backward through layers without shrinking.
Why is a nonlinear activation function necessary between layers of a neural network? What happens if all activations are removed?