A researcher has 400 labeled chest X-rays and wants to classify pneumonia. She loads a CNN pretrained on ImageNet and plans to retrain the model. Which strategy is most likely to achieve the best performance?
ARetrain all layers from scratch using the pretrained weights as starting values, with a high learning rate
BFreeze all layers except the final classification head, since the data are too scarce to safely update any features
CFreeze the early layers (generic feature detectors) and fine-tune the later layers plus a new classification head with a low learning rate
DDiscard the pretrained weights and train from random initialization to avoid domain mismatch
With limited target data, the strategy is to keep generic early-layer features frozen (they transfer well — edges, textures, gradients apply to X-rays too) and fine-tune later layers that encode higher-level, task-specific representations. Using a low learning rate prevents overwriting the useful pretrained weights. Option A risks destroying useful features. Option B is overly conservative — some fine-tuning of later layers is almost always beneficial. Option D throws away the entire transfer learning advantage.
Question 2 Multiple Choice
Transfer learning from an ImageNet-pretrained CNN to a satellite imagery task is expected to be less effective than transfer to a natural-photo task. The best explanation is that:
AImageNet models have too many parameters to be useful for any other task
BSatellite images have different pixel value distributions, which confuses the pretrained softmax classifier
CThe later layers of an ImageNet model encode features (dog faces, bird shapes) that are irrelevant to overhead views, requiring more extensive fine-tuning
DTransfer learning only works when source and target tasks share the same number of classes
Transfer learning effectiveness degrades as domain distance increases. Early layers (edge detectors, texture patterns) still transfer from ImageNet to satellite imagery, but later layers encode high-level features tuned to ground-level natural images — these are far less useful for classifying overhead views of fields, buildings, or roads. More layers need fine-tuning, requiring more target data. The number of output classes (option D) is irrelevant — the final classification layer is always replaced for a new task.
Question 3 True / False
Transfer learning is primarily useful when the target task has the same output classes as the source task.
TTrue
FFalse
Answer: False
The final classification layer is always replaced for a new task — the value of transfer learning lies in reusing the intermediate feature representations, not the class labels. A model trained to classify 1,000 ImageNet categories can be adapted to a 2-class medical diagnosis task by replacing the last layer. The pretrained feature hierarchy (edges → textures → shapes → high-level patterns) is what transfers, independent of the original class set.
Question 4 True / False
Early convolutional layers of a network trained on ImageNet learn generic features like edge detectors and color gradients that are broadly useful across visual tasks.
TTrue
FFalse
Answer: True
This has been verified empirically by visualizing what different layers in trained CNNs respond to. Early layers develop Gabor-filter-like edge detectors and color blobs that appear in any image task. This generic quality is precisely why they transfer so well — whether the downstream task involves medical scans, satellite imagery, or product photos, these low-level features remain relevant. Later layers become increasingly task-specific and transfer less reliably.
Question 5 Short Answer
Why does transfer learning from a large source task typically outperform training from scratch on a small target dataset, and what determines how many layers should be frozen versus fine-tuned?
Think about your answer, then reveal below.
Model answer: Transfer learning works because deep networks learn a hierarchical feature vocabulary — early layers capture generic, reusable primitives (edges, textures) while later layers encode task-specific combinations. Starting from a pretrained network provides good feature initializations that prevent the overfitting that would occur when fitting millions of parameters to a small dataset from scratch. How many layers to freeze depends on domain similarity and target data size: when domains are similar and data is scarce, freeze more early layers (their features already apply); when domains are distant or data is abundant, unfreeze more layers to allow deeper adaptation.
The underlying principle is that training from scratch on small data massively overfits — the model memorizes the training examples rather than learning generalizable features. Pretrained features provide a strong prior that constrains the hypothesis space. The freeze/fine-tune decision balances underfitting risk (too frozen — later layers can't adapt) against overfitting risk (too unfrozen — few data points to guide many parameters). A low learning rate during fine-tuning gently shifts useful pretrained weights toward the target domain rather than destroying them.