A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Feature Scaling and Normalization

Graduate Depth 101 in the knowledge graph ☐ I know this ☆ Set as goal

771prerequisites beneath it

Feature Engineering and Selection Batch Normalization +2 more→

Core Idea

Feature scaling transforms features to comparable ranges (standardization: zero mean and unit variance; normalization: [0, 1] range). Distance-based algorithms (KNN, SVM) and gradient-based methods (neural networks) are sensitive to feature scale. Improper scaling causes slow convergence and numerical instability.

How It's Best Learned

Fit scalers on training data only, then apply consistently to test data. Compare model performance with and without scaling across different algorithms.

Common Misconceptions

Scaling means the same thing as one-hot encoding; improperly applying test-set scaling introduces data leakage.

Explainer

From your work on feature engineering, you know that the raw features in a dataset can vary wildly in their numeric ranges. A dataset might include age (0–100), income (20,000–500,000), and a binary indicator (0 or 1). Most machine learning algorithms treat these numbers at face value, and when one feature's range is thousands of times larger than another's, it can dominate the model's behavior in unintended ways. Feature scaling transforms all features to comparable ranges so that no single feature overwhelms the others simply because of its units or magnitude.

The two most common techniques are standardization and min-max normalization. Standardization (also called z-score normalization, which you have seen in statistics) subtracts the mean and divides by the standard deviation, producing features with zero mean and unit variance. Min-max normalization rescales each feature to a fixed range, typically [0, 1], by subtracting the minimum and dividing by the range. Standardization is generally preferred when the data contains outliers, because it does not bound the output to a fixed range — an outlier becomes a large z-score rather than compressing all other values into a tiny slice of [0, 1]. Min-max normalization is useful when you need bounded values, such as for neural network inputs that expect values in [0, 1].

Why does scaling matter? Distance-based algorithms like k-nearest neighbors and support vector machines compute distances between data points. If income ranges from 20,000 to 500,000 and age ranges from 0 to 100, the distance calculation is almost entirely determined by income — a difference of 10,000 in income swamps a difference of 50 in age, even though both might be equally important. Scaling puts both features on an equal footing. Gradient-based methods like neural networks and logistic regression are also sensitive: features with large magnitudes produce large gradients, causing the optimization to oscillate along those dimensions while creeping along others. Scaling creates a smoother, more symmetric loss surface that gradient descent can navigate efficiently.

A critical practical rule is that scaling parameters must be computed from the training set only and then applied identically to the test set. If you compute the mean and standard deviation using all available data (including the test set), you leak information from the test set into the training process — this is data leakage, and it produces overly optimistic performance estimates that do not hold on truly unseen data. In practice, this means fitting a scaler object on the training data and using its `transform` method on both training and test data, never calling `fit` on the test set. This discipline extends to cross-validation: scaling must happen inside each fold, not before the split.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Tree Structure and Node Properties → Binary Trees → Tree Traversals → Depth-First Search (DFS) → Depth-First Search: Implementation and Applications → Topological Sort → Dynamic Programming → Longest Common Subsequence (LCS) Problem → Edit Distance: Levenshtein Distance and DP → 0/1 Knapsack Problem: Bounded Capacity DP → Greedy Algorithms → Activity Selection Problem Using Greedy Algorithms → Dijkstra's Algorithm → A* Search Algorithm → Heuristic Search Functions → Local Search Optimization → Genetic Algorithms → Stochastic Gradient Descent and Variants → Batch Normalization → Feature Scaling and Normalization

Longest path: 102 steps · 771 total prerequisite topics

Prerequisites (4)

Feature Engineering and Selectionhard Mean, Median, and Modesoft Standard Normal Distribution and Z-Score Standardizationsoft Batch Normalizationsoft

Leads To (0)

No topics depend on this one yet.