A researcher tries to repurpose a standard CNN image classifier for semantic segmentation by attaching a softmax layer that outputs class probabilities independently for each pixel. What is the fundamental problem with this approach?
ACNNs cannot process images with more than three channels, making pixel-level output impossible
BProgressive pooling and striding reduce spatial resolution so severely that per-pixel localization is lost by the final layers
CSoftmax normalization across all pixels forces the model to assign each class to exactly one region
DThe classification loss function is incompatible with pixel-level supervision
Standard CNNs for classification use repeated pooling and strided convolutions that dramatically reduce spatial resolution — a 224×224 input might become a 7×7 feature map. This compactness is fine for producing a single label, but segmentation requires full-resolution output. The spatial location of individual pixels is irretrievably lost during downsampling. Fully convolutional networks and encoder-decoder architectures specifically address this by replacing or reversing the downsampling.
Question 2 Multiple Choice
A semantic segmentation model produces accurate class predictions but jagged, imprecise boundaries around objects. Which architectural modification would most directly address this?
AAdding more pooling layers to increase the semantic richness of features
BReplacing dilated convolutions with standard convolutions to reduce receptive field size
CAdding skip connections that forward high-resolution feature maps from early encoder layers to the decoder
DIncreasing the number of output classes to capture finer boundary categories
Blurry or jagged boundaries result from the decoder reconstructing spatial detail from a coarse, semantically rich representation alone. Early encoder layers contain fine-grained spatial information (edges, textures) at full or near-full resolution, but this information is lost as depth increases. Skip connections forward these high-resolution feature maps directly to corresponding decoder levels, allowing the decoder to combine semantic context from deep layers with spatial precision from early layers — precisely what U-Net's architecture provides.
Question 3 True / False
Dilated (atrous) convolutions expand the receptive field by adding more learnable parameters to the convolutional kernel.
TTrue
FFalse
Answer: False
Dilated convolutions expand the receptive field by spacing out the sampling locations of an existing kernel — a 3×3 kernel with dilation rate 2 covers a 5×5 area but still uses only 9 parameters. No new parameters are added. This is the key advantage: large receptive fields (needed to capture context for correct pixel classification) are achieved without the parameter cost or resolution reduction that would come from larger standard kernels or additional pooling layers.
Question 4 True / False
Skip connections in encoder-decoder segmentation models (such as U-Net) allow the decoder to recover fine spatial details that are progressively lost during encoding.
TTrue
FFalse
Answer: True
This is exactly the role skip connections play. During encoding, downsampling increases semantic richness but destroys spatial precision. Skip connections bypass this bottleneck by routing high-resolution feature maps from early encoder layers directly to corresponding decoder layers. The decoder can then combine broad semantic understanding (from the bottleneck) with sharp spatial detail (from the skip connections), producing accurate segmentation with well-defined boundaries.
Question 5 Short Answer
Explain the fundamental tension in semantic segmentation between spatial resolution and semantic richness, and describe how encoder-decoder architectures resolve it.
Think about your answer, then reveal below.
Model answer: Deep CNNs build semantic richness through downsampling: pooling and striding compress the spatial map so that deep feature maps represent large receptive fields and abstract categories. But segmentation requires a full-resolution output map where each pixel has a label, so the spatial information destroyed during encoding must be recovered. Encoder-decoder architectures resolve this by pairing a standard encoding (downsampling) path with a decoding (upsampling) path that restores resolution. Skip connections bridge the two paths, forwarding high-resolution spatial features from early encoder layers to the decoder so that boundary precision and semantic accuracy are achieved simultaneously.
The core insight is that classification and localization require opposing properties from a network: classification benefits from large receptive fields and abstract representations (achieved by downsampling), while localization requires precise spatial detail (destroyed by downsampling). Encoder-decoder architectures with skip connections represent the canonical solution to this tension in dense prediction tasks.