A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Fine-Tuning Pretrained Models

Graduate Depth 102 in the knowledge graph ☐ I know this ☆ Set as goal

782prerequisites beneath it

Backpropagation Algorithm Transfer Learning in Neural Networks +2 more→

Core Idea

Fine-tuning adapts a pretrained model to a new task by continuing training on task-specific data, often with a lower learning rate to avoid catastrophically forgetting learned features. The number of layers to fine-tune balances adaptation (more layers) with regularization (fewer layers); layer-wise learning rates (lower for early layers) are effective for training stability.

How It's Best Learned

Compare different fine-tuning strategies: frozen base layers only, unfrozen with low learning rate, and layer-wise varying learning rates, measuring final accuracy and computational cost.

Explainer

From transfer learning, you know that a neural network trained on a large dataset learns features that are useful far beyond its original task. The early layers of an image classifier trained on ImageNet learn edge detectors, texture recognizers, and color patterns; the middle layers learn parts and shapes; the later layers learn task-specific compositions. Fine-tuning is the process of taking such a pretrained model and adapting it to your specific task — say, classifying medical images or identifying bird species — by continuing training on your (typically smaller) dataset.

The simplest approach is feature extraction: freeze all the pretrained layers, replace the final classification head with a new one matching your number of classes, and train only that new head. This treats the pretrained network as a fixed feature extractor. It works well when your task is similar to the original and your dataset is small, because you are only optimizing a few parameters and cannot overfit easily. But if your task differs significantly from the pretraining domain (e.g., medical X-rays versus natural photos), the frozen features may not transfer perfectly, and you need to let deeper layers adapt.

Full fine-tuning unfreezes all layers and trains the entire network on your data, but this requires care. The key risk is catastrophic forgetting: if you train with a normal learning rate, the useful features in the early layers get overwritten before the network can adapt them to the new task. The solution is to use a much lower learning rate than you would for training from scratch — typically 10× to 100× smaller. This lets the weights drift gently toward task-specific solutions without destroying the pretrained representations. Think of it as nudging the network rather than retraining it.

The most sophisticated strategy uses discriminative (layer-wise) learning rates, where early layers get the smallest learning rate and later layers get progressively larger ones. The rationale is that early features (edges, textures) are nearly universal and need minimal adjustment, while later features are more task-specific and need more adaptation. A common recipe is to set the last layer's learning rate to some base value and reduce it by a factor of 2-3 for each preceding layer group. Combined with techniques like gradual unfreezing — starting by training only the head, then unfreezing one layer group at a time — this approach consistently achieves strong performance even with very small datasets. The number of layers to fine-tune becomes a regularization knob: fewer unfrozen layers means less capacity to adapt but also less risk of overfitting, making this a balance you tune based on dataset size and domain similarity.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Tree Structure and Node Properties → Binary Trees → Tree Traversals → Depth-First Search (DFS) → Depth-First Search: Implementation and Applications → Topological Sort → Dynamic Programming → Longest Common Subsequence (LCS) Problem → Edit Distance: Levenshtein Distance and DP → 0/1 Knapsack Problem: Bounded Capacity DP → Greedy Algorithms → Activity Selection Problem Using Greedy Algorithms → Dijkstra's Algorithm → A* Search Algorithm → Heuristic Search Functions → Local Search Optimization → Genetic Algorithms → Stochastic Gradient Descent and Variants → Optimization Algorithms: SGD, Adam, RMSprop → Hyperparameter Optimization → Fine-Tuning Pretrained Models

Longest path: 103 steps · 782 total prerequisite topics

Prerequisites (4)

Transfer Learning in Neural Networkshard Backpropagation Algorithmhard Hyperparameter Optimizationsoft Gradient Descent and Optimizationsoft

Leads To (0)

No topics depend on this one yet.