← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Least Squares Estimation

College Depth 82 in the knowledge graph ☐ I know this ☆ Set as goal

217topics build on this

327prerequisites beneath it

See this on the map →

Linear Regression and Least Squares Estimation→→Linear Regression Basics Maximum Likelihood Estimation +1 more

Core Idea

Least squares estimation minimizes the sum of squared residuals: Σ(yᵢ - ŷᵢ)². For simple linear regression, this yields slope = r(s_y/s_x) and intercept = ȳ - b·x̄. Least squares is intuitive and optimal under normality.

How It's Best Learned

Fit linear regression by hand for a small dataset. Visualize residuals and understand what minimizing their squared sum means geometrically. Compare least squares to other fitting methods.

Common Misconceptions

Thinking least squares requires normal errors (it gives optimal linear fit regardless). Assuming high R² means good predictions. Not recognizing that outliers can heavily influence least squares estimates.

Explainer

From your study of linear regression, you know the goal: given paired data (x₁, y₁), ..., (xₙ, yₙ), find the line ŷ = b₀ + b₁x that best describes the relationship between x and y. But "best" needs a precise definition. Least squares estimation defines "best" as the line that minimizes the sum of squared residuals: Σᵢ(yᵢ − ŷᵢ)² = Σᵢ(yᵢ − b₀ − b₁xᵢ)². Each residual yᵢ − ŷᵢ measures how far the observed value falls from the fitted line, and squaring these residuals produces a smooth, differentiable objective function whose minimum can be found analytically.

The minimization is a calculus problem. Taking partial derivatives of Σ(yᵢ − b₀ − b₁xᵢ)² with respect to b₀ and b₁, setting them to zero, and solving the resulting system of two linear equations (the normal equations) yields closed-form solutions: b₁ = r · (s_y / s_x) and b₀ = ȳ − b₁x̄, where r is the sample correlation coefficient, s_y and s_x are the sample standard deviations, and x̄ and ȳ are the sample means. The slope b₁ is proportional to the correlation — a natural result, since both measure the strength and direction of the linear relationship. The intercept b₀ ensures the line passes through the point (x̄, ȳ), the center of the data.

Why minimize squared residuals rather than, say, absolute residuals? Squaring has three key consequences. First, it makes the objective function differentiable everywhere, enabling the clean calculus-based solution above — absolute values create a kink at zero that prevents closed-form solutions. Second, squaring penalizes large residuals disproportionately: a residual of 10 contributes 100 to the objective, while a residual of 1 contributes just 1. This means outliers pull the fitted line strongly toward them. Third, under the assumption of normally distributed errors, least squares produces the maximum likelihood estimate — the statistically optimal fit. Without normality, least squares still gives the best linear unbiased estimator (BLUE) by the Gauss-Markov theorem, provided errors have equal variance and are uncorrelated.

A common misconception is that least squares requires normally distributed errors. It does not — the formulas for b₀ and b₁ are purely algebraic and minimize the sum of squared residuals regardless of the error distribution. Normality is only needed for the inferential layer: confidence intervals, t-tests on coefficients, and F-tests for model significance all assume normal errors. Another pitfall is interpreting R² = 1 − (SS_residual / SS_total) as proof of a good model. A high R² means the model explains a large share of variation in the training data, but it says nothing about predictive accuracy on new data. Overfitting, extrapolation, and omitted variables can all produce high R² with poor predictions.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Vectors in Two Dimensions → Vector Operations: Addition, Subtraction, and Scalar Multiplication → Dot Product (Inner Product in R^n) → Inner Product Spaces → Orthogonality → Orthogonal Projections → Orthogonal Projections and Least Squares Approximation → Linear Regression and Least Squares Estimation → Least Squares Estimation

Longest path: 83 steps · 327 total prerequisite topics

Prerequisites (1)

Linear Regression and Least Squares Estimationsoft

Leads To (3)

Linear Regression Basicssoft Maximum Likelihood Estimationsoft Regression Diagnostics and Residual Analysishard