Two object detection systems are benchmarked: System A runs at 4 FPS with 87% mean average precision (mAP); System B runs at 50 FPS with 76% mAP. Which architectural family most likely corresponds to each?
CA: R-CNN with selective search; B: SSD single-shot detector
DA: sliding-window CNN classifier; B: Faster R-CNN with Feature Pyramid Network
Two-stage detectors (Faster R-CNN family) propose regions and then classify them, achieving higher accuracy at the cost of speed. Single-shot detectors (YOLO, SSD family) predict boxes directly in one forward pass, enabling real-time speeds at a modest accuracy penalty. The 4 FPS / high accuracy profile is characteristic of two-stage methods; 50 FPS / slightly lower accuracy is characteristic of single-shot methods.
Question 2 Multiple Choice
A detector produces 18 overlapping bounding boxes around the same cat in an image, all with varying confidence scores. What technique selects the single best prediction and discards the rest?
AFeature Pyramid Network (FPN), which merges multi-scale features into one prediction
BRegion Proposal Network (RPN), which filters out redundant proposals before classification
CNon-maximum suppression (NMS), which keeps the highest-confidence box and removes overlapping duplicates
DAnchor box matching, which assigns each object to exactly one grid cell
Non-maximum suppression (NMS) is the post-processing step that resolves duplicate detections. It sorts candidate boxes by confidence, keeps the highest-scoring box, and suppresses all other boxes with high IoU (intersection over union) overlap with the kept box, iterating until no duplicates remain. FPN solves multi-scale detection; RPN generates proposals but does not resolve duplicates; anchor matching assigns proposals but does not suppress them.
Question 3 True / False
In Faster R-CNN, the convolutional backbone processes the image only once, and the resulting feature map is shared between the Region Proposal Network and the classification head.
TTrue
FFalse
Answer: True
Shared feature computation is the key innovation of Faster R-CNN over its predecessor Fast R-CNN and the original R-CNN. Rather than running a separate CNN on each candidate region (thousands of forward passes), Faster R-CNN computes the feature map once and allows both the RPN and classifier to operate on the same features. This dramatically reduces computation and is what makes two-stage detection tractable at near-real-time speeds.
Question 4 True / False
Object detection is fundamentally equivalent to running an image classifier on a sliding window at nearly every possible location and scale, making it a straightforward extension of image classification.
TTrue
FFalse
Answer: False
Sliding-window classification is the brute-force baseline that deep detection networks were designed to replace. Modern detectors (R-CNN family, YOLO, SSD) do not exhaustively scan all positions and scales — they learn to directly predict bounding box coordinates and class scores in ways that are far more computationally efficient. Single-shot methods like YOLO treat detection as a regression problem with a fixed-size output tensor, which is fundamentally different from applying a classifier thousands of times.
Question 5 Short Answer
Explain why Feature Pyramid Networks (FPN) are used in object detection, and what problem they solve that a single feature map from the last convolutional layer cannot handle.
Think about your answer, then reveal below.
Model answer: A single feature map from the last layer has low spatial resolution and high-level semantics — it can recognize large objects but misses small ones because the spatial detail has been pooled away. FPN builds a multi-scale feature hierarchy by combining high-resolution, low-level features (which retain spatial detail for detecting small objects) with low-resolution, high-level features (which have rich semantic information for detecting large objects). Predictions are made at multiple scales simultaneously, allowing the detector to handle objects of vastly different sizes in the same image.
The scale variation problem is one of the central challenges in detection: a person far away appears tiny while one up close fills the frame. Without FPN, a detector optimized for large objects misses small ones and vice versa. FPN solves this by making predictions at multiple feature pyramid levels, each tuned to a different scale range.