A CNN is trained on images of stop signs appearing in the center of frames. A new image has a stop sign in the upper-left corner. What does the CNN's convolutional architecture predict?
AThe network fails — it learned weights specific to the center position and must be retrained
BThe activation for the stop sign feature shifts to the upper-left of the feature map, because CNNs are translation equivariant
CPooling layers correct for position, producing the same output regardless of where the sign appears
DThe network detects the sign only if it was also trained on images with upper-left stop signs
Translation equivariance is the core property of convolutional layers: the same filter slides across every position, so detecting a feature in a new location simply means the corresponding activation appears at the new position in the feature map. This is fundamentally different from a fully connected network, where each input position has unique weights — a pattern at a new position truly would require retraining. The equivariance property is not an accident; it is a direct consequence of weight sharing. Note that pooling adds approximate translation invariance (the final output may not change at all for small shifts), but equivariance in the feature maps comes from convolution itself.
Question 2 Multiple Choice
Why does weight sharing in a convolutional layer dramatically reduce the number of parameters compared to a fully connected layer processing the same input?
AConvolutional layers use simpler activation functions that require fewer computations
BThe same small filter (e.g., 3×3 weights) is applied at every spatial position, so filter parameters are not duplicated per position
CPooling layers remove most neurons before any learned weights are applied
DCNNs process each color channel independently, reducing the effective input size
In a fully connected layer, every input pixel has its own unique weight connecting it to every neuron — for a 256×256 image with 1,000 hidden neurons, that is ~200 million weights. In a convolutional layer with a 3×3 filter, those 9 weights (plus a bias) are shared across all spatial positions. If the input is 256×256, the same 9 weights are applied at each of ~65,000 positions. Parameter count drops from millions to single digits per filter. The network learns fewer numbers but applies them everywhere — which also encodes the assumption that the same local feature detector is useful throughout the image.
Question 3 True / False
A convolutional layer is translation equivariant: moving a feature in the input produces a corresponding shift in the feature map output.
TTrue
FFalse
Answer: True
Translation equivariance is a defining property of convolution. Because the filter slides across the entire input with the same weights, detecting a pattern at position (x, y) produces an activation at the corresponding location in the output feature map. If the pattern moves to (x+5, y+3), the activation shifts by the same amount. This is not translation invariance (same output regardless of position) — the output changes, but in a perfectly predictable, consistent way. Equivariance is what allows early layers to detect features and later layers to combine them regardless of absolute position.
Question 4 True / False
Max pooling layers are what give CNNs their translation equivariance property.
TTrue
FFalse
Answer: False
Translation equivariance comes from the convolutional layers, not from pooling. Pooling provides a related but different property: approximate translation invariance — small shifts in the input may produce the same pooled output, because the maximum value within a region is unaffected by small displacements. Equivariance (output shifts with input) is a property of convolution. Invariance (output stays the same) is what pooling adds. Conflating the two is a common error. Many tasks benefit from equivariance in intermediate representations (to locate features) and invariance at the final output (to classify regardless of exact position).
Question 5 Short Answer
What inductive bias does a CNN encode, and why does this make it more appropriate than a fully connected network for image classification?
Think about your answer, then reveal below.
Model answer: A CNN encodes the inductive bias that useful visual features are local (detectable from small patches) and position-independent (the same feature detector should work everywhere in the image). This is captured by small filters (locality) and weight sharing (position independence). A fully connected network has no such bias — it treats every pixel as equally related to every other, must independently learn that the same edge detector is useful at the top-left and bottom-right, and requires far more data and parameters to match a CNN's performance on images.
An inductive bias is a prior assumption about the structure of the problem baked into the architecture. CNNs encode two: local connectivity (nearby pixels are more related than distant ones) and translation equivariance (the same patterns matter regardless of position). These assumptions are almost always true for natural images, which is why CNNs are so effective at image tasks even with limited training data. Fully connected networks can theoretically represent the same functions, but they need exponentially more data to discover these spatial regularities from scratch.