A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Least Squares Approximation and Normal Equations

College Depth 84 in the knowledge graph ☐ I know this ☆ Set as goal

114topics build on this

356prerequisites beneath it

Column Space and Row Space Gram-Schmidt Orthogonalization Process +4 more→→Least Squares Regression: Fundamentals and Derivation Linear Regression in Machine Learning

Core Idea

For an inconsistent system Ax = b, the least squares solution minimizes ||Ax − b||². The solution satisfies A^T Ax = A^T b (the normal equations), giving x̂ = (A^T A)^-1 A^T b when A has full column rank. Least squares finds the best approximation when exact solutions don't exist, essential in statistics and data fitting.

Explainer

Most real-world systems are overdetermined: you have more equations than unknowns, and no single solution satisfies all of them simultaneously. Think of fitting a line to 100 data points — the line can't pass exactly through every point, so you want the line that comes as close as possible to all of them. This is exactly what least squares does. When Ax = b has no solution, least squares asks: what vector x̂ makes Ax̂ as close to b as possible, measured by the Euclidean distance ||Ax − b||?

The answer has a beautiful geometric interpretation rooted in the Gram-Schmidt work you've already done. The matrix A's columns span a subspace (the column space of A). The vector b may not lie in that subspace — that's precisely why the system is inconsistent. The best approximation Ax̂ is the orthogonal projection of b onto the column space of A. The residual vector b − Ax̂ must be perpendicular to every column of A. Writing this orthogonality condition as A^T(b − Ax̂) = 0 immediately yields the normal equations: A^T Ax̂ = A^T b. This is a square, solvable system even when the original was not.

When A has full column rank (its columns are linearly independent), A^T A is invertible and the unique solution is x̂ = (A^T A)^-1 A^T b. The matrix (A^T A)^-1 A^T is called the pseudoinverse of A. In statistics, this formula underlies ordinary least squares regression: if you set up the matrix A with a column of ones and a column of predictor values, the least squares solution gives you the intercept and slope of the best-fit line. The geometry — projecting b onto the column space — makes clear why this works and what "best" means precisely.

When A does not have full column rank (columns are linearly dependent), the normal equations still have solutions but the solution is not unique. In practice this signals a redundant predictor in a regression model. The Gram-Schmidt process you studied provides one route to handling this: QR decomposition factors A = QR, after which the normal equations simplify to Rx̂ = Q^T b, which is easy to solve by back-substitution. This is numerically preferable to forming A^T A directly, since squaring the matrix doubles the condition number and amplifies floating-point errors.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Vectors in Two Dimensions → Vector Operations: Addition, Subtraction, and Scalar Multiplication → Dot Product (Inner Product in R^n) → Matrix Multiplication → Determinants of 2×2 and 3×3 Matrices → Invertible Matrices and Matrix Inverses → Systems of Linear Equations and Matrix Form → Gaussian Elimination and Row Reduction → Null Space and Kernel → Kernel and Image of Linear Transformations → Least Squares Approximation and Normal Equations

Longest path: 85 steps · 356 total prerequisite topics

Prerequisites (6)

Gram-Schmidt Orthogonalization Processhard Systems of Linear Equations and Matrix Formhard Column Space and Row Spacehard Orthogonal Projectionshard Gram-Schmidt Process and QR Decompositionsoft Kernel and Image of Linear Transformationssoft

Leads To (2)

Least Squares Regression: Fundamentals and Derivationhard Linear Regression in Machine Learninghard