5 questions to test your understanding
A CNN trained on faces correctly classifies a normal face image. When shown an image with two eyes positioned below the mouth (spatially scrambled), the CNN still outputs high confidence for 'face.' What architectural feature of CNNs explains this failure?
In a capsule network, a 'mouth capsule' outputs a vector with length 0.95 and a specific orientation. What does each component represent?
In routing by agreement, connections between lower-level capsules (parts) and higher-level capsules (wholes) are strengthened when the part capsules' predictions about the parent capsule are geometrically consistent with each other.
Capsule networks are more computationally efficient than CNNs because routing by agreement eliminates the need for multiple convolutional layers.
Why does a capsule's vector output achieve viewpoint equivariance more structurally than a CNN's pooling-based approach?