You have a small dataset of 500 medical X-ray images and want to fine-tune an ImageNet-pretrained ResNet. Which strategy is most appropriate?
AFull fine-tuning with a standard learning rate, since the large model capacity is needed for medical images
BFeature extraction (freeze all pretrained layers, train only the new head) to avoid overfitting with limited data
CTrain from scratch with random initialization to ensure the model learns medical-specific features
DUse discriminative learning rates with the highest rate on early layers since medical features differ most there
With a small dataset, the main risk is overfitting. Feature extraction freezes the pretrained weights (which already encode powerful general features like edges and shapes) and trains only the small classification head — minimizing the number of parameters to optimize and preventing overfitting. Full fine-tuning with a standard learning rate risks catastrophic forgetting and overfitting on 500 examples. Training from scratch would require far more data. Discriminative learning rates have it backwards — early layers are most universal and need the smallest rates.
Question 2 Multiple Choice
In discriminative (layer-wise) fine-tuning, early layers receive smaller learning rates than later layers. Why?
AEarly layers have more parameters and need smaller updates for numerical stability
BEarly layers learn universal features (edges, textures) that transfer well and need minimal adjustment
CEarly layers are closer to the output and thus more sensitive to gradient updates
DEarlier layers converge faster, so they need smaller learning rates to prevent overshooting
Early layers in deep networks learn low-level features (edge detectors, color patterns, textures) that are nearly universal across image tasks. These features are already well-adapted and should change minimally. Later layers encode more task-specific representations that genuinely need to adapt to the new domain. Using small learning rates for early layers preserves the valuable pretrained representations while allowing later layers to adjust. The early layers are also furthest from the loss and receive the smallest gradients naturally — the small learning rate reinforces this stability.
Question 3 True / False
Fine-tuning a pretrained model with a learning rate close to the original pretraining rate is a safe starting point because the pretrained weights provide a good initialization.
TTrue
FFalse
Answer: False
This is a dangerous misconception. Using a normal (pretraining-scale) learning rate during fine-tuning causes catastrophic forgetting — the useful features encoded in the early layers get overwritten rapidly before the network adapts them coherently to the new task. The correct approach is a learning rate 10× to 100× smaller than pretraining. The pretrained initialization is valuable precisely because it represents well-learned features; a high learning rate destroys that value by making large, uncoordinated updates to all weights simultaneously.
Question 4 True / False
Feature extraction (freezing all pretrained layers and training only the new head) performs worse than full fine-tuning when the target task is very different from the pretraining domain.
TTrue
FFalse
Answer: True
When the source and target domains differ significantly (e.g., natural photos vs. satellite imagery vs. medical scans), the features in early and mid-level layers may not transfer well. Feature extraction assumes these frozen representations are useful for the new task. If they are not, no amount of training on the new head can compensate — the inputs to the head remain poorly suited. Full fine-tuning (with a low learning rate) allows the network to adapt its representations to the new domain, often substantially improving performance despite the risk of overfitting.
Question 5 Short Answer
Why is a much lower learning rate used when fine-tuning a pretrained model compared to training from scratch, and what specific failure mode does it prevent?
Think about your answer, then reveal below.
Model answer: A low learning rate prevents catastrophic forgetting — the phenomenon where fine-tuning with large weight updates destroys the useful feature representations learned during pretraining. With a standard learning rate, the pretrained weights are overwritten rapidly, effectively erasing the benefit of pretraining. A smaller learning rate (typically 10–100× lower) allows weights to drift gently toward task-specific solutions while preserving the pretrained structure. Training from scratch uses a higher rate because there are no useful weights to preserve.
The key insight is that the pretrained weights encode a rich representation built from vast data. Fine-tuning should refine, not replace, these representations. Catastrophic forgetting is well-documented: if you train a language model on task A, then fine-tune on task B with a large learning rate, performance on task A collapses. The same mechanism applies to vision models. The low learning rate is not primarily about convergence speed — it is about preserving the signal already embedded in the weights.