You build a 10-layer neural network but replace every activation function with the identity function (f(x) = x), so every neuron computes a purely linear transformation. Compared to a single-layer linear network, this 10-layer network can represent:
AExponentially more complex functions because it has 10 times as many layers
BExactly the same class of functions — only linear mappings — because the composition of linear functions is linear
CMore complex functions because deeper networks always have greater representational power
DSlightly more complex functions due to the increased number of parameters
A composition of linear functions is still a linear function, regardless of depth. If each layer computes W_i * x + b_i, the full composition W_n * (... * (W_1 * x + b_1) ...) + b_n is equivalent to a single affine transformation. No matter how many layers you stack, a network with only linear activations cannot represent XOR or any other non-linear function. This is why nonlinear activation functions are not optional — they are what gives deep networks their representational power.
Question 2 Multiple Choice
A student reads the universal approximation theorem and concludes: 'Since a single hidden layer MLP can approximate any continuous function, there is never a practical reason to use deep networks.' What is the critical flaw in this reasoning?
AThe theorem only applies to regression problems, not classification
BThe theorem requires the activation function to be linear, which contradicts using hidden layers
CThe theorem guarantees approximation exists but does not bound the number of neurons required — a shallow network may need exponentially more neurons than a deep one for the same accuracy
DDeep networks are only better when training data is large, so the theorem applies equally to small datasets
The universal approximation theorem is an existence result: it guarantees that a shallow MLP *can* approximate any continuous function given enough neurons, but 'enough' can be astronomically many. A deep network can represent the same function hierarchically, reusing learned features across layers, and often requires far fewer total parameters. The LEGO analogy captures this: you *could* build a 3D shape from a single flat layer of tiny bricks, but stacking layers is vastly more efficient. The theorem tells you what's possible, not what's practical.
Question 3 True / False
A neural network without any nonlinear activation functions in its hidden layers has the same representational power as a single linear layer, regardless of how many hidden layers it has.
TTrue
FFalse
Answer: True
Correct. The composition of any number of linear (affine) transformations is itself a linear transformation. Without nonlinearity, stacking layers adds parameters but no expressive power — the network is functionally equivalent to a single matrix multiply plus bias. This is why activation functions like ReLU, sigmoid, or tanh are essential: they are what allow the network to learn non-linear decision boundaries and complex feature hierarchies.
Question 4 True / False
According to the universal approximation theorem, in practice a single hidden-layer MLP is generally as efficient (in terms of total parameters) as a deep network for approximating complex functions.
TTrue
FFalse
Answer: False
The theorem guarantees that a sufficiently wide single hidden layer *can* approximate any continuous function, but it says nothing about efficiency. For many complex functions, a shallow network would need exponentially more neurons than a comparable deep network. Deep networks learn hierarchical features — early layers detect simple patterns, later layers combine them into complex abstractions — allowing them to reuse representations efficiently. In practice, for most real-world tasks, deep networks achieve better performance with fewer total parameters than shallow wide networks.
Question 5 Short Answer
Why is a nonlinear activation function essential in hidden layers of an MLP, and what would be lost without it?
Think about your answer, then reveal below.
Model answer: Without nonlinear activation functions, every hidden layer computes an affine transformation (matrix multiply plus bias), and the composition of affine transformations is itself affine. No matter how many layers are stacked, the network can only represent linear input-output relationships — it cannot solve XOR, classify non-linearly separable data, or approximate curved functions. The activation function (ReLU, sigmoid, tanh) introduces the bending and folding of input space that allows each layer to carve out increasingly complex decision regions. Nonlinearity is what transforms a stack of linear operations into a universal function approximator.
This is the central conceptual point of the MLP: depth without nonlinearity is useless. The activation function after each layer is what gives each layer the power to transform the representation in a non-trivial way, so that subsequent layers operate on a different 'view' of the data rather than just a linearly rescaled version of the original input.