A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Linear Regression in Machine Learning

Graduate Depth 90 in the knowledge graph ☐ I know this ☆ Set as goal

100topics build on this

624prerequisites beneath it

Expected Value Least Squares Approximation and Normal Equations +8 more→→Decision Boundaries in Classification Logistic Regression for Classification +2 more

Core Idea

Linear regression models continuous outputs as linear combinations of input features. Least squares minimizes squared error; solutions are computed analytically via normal equations or iteratively via gradient descent. Extensions include regularization and polynomial features.

Explainer

Linear regression is the foundational model of supervised learning: given input features and continuous output labels, find the linear combination of features that best predicts the output. Despite its simplicity, understanding it deeply — why it works, how it's solved, and when it fails — provides the conceptual scaffolding for nearly every more advanced method.

The model is ŷ = Xβ, where X is the feature matrix (n examples × p features, with a column of ones for the intercept), β is the parameter vector to learn, and ŷ is the prediction vector. The goal is to choose β minimizing the sum of squared residuals. This objective has a unique minimum (assuming XᵀX is invertible) given by the normal equations: β = (XᵀX)⁻¹Xᵀy. This is an exact, closed-form solution — no iteration required. The matrix XᵀX captures the covariance structure of the features, and its inverse "un-rotates" and "un-stretches" the feature space to align predictions with targets.

For large feature sets, computing (XᵀX)⁻¹ becomes too expensive — inverting a p×p matrix scales as O(p³). Gradient descent is the practical alternative: start with arbitrary β, compute the gradient of the loss with respect to β, and take a small step in the direction that reduces the loss. Repeat until convergence. Each step costs O(n·p), scaling gracefully to millions of features. This is the same mechanism used to train neural networks, which is why understanding gradient descent in the linear regression setting is non-negotiable before moving to deeper models.

A common misconception is that "linear regression can only model straight-line relationships." The linearity refers to the parameters, not the features. Adding x², log(x), or interaction terms to the feature matrix allows linear regression to fit curves and interactions. The model is still "linear" because ŷ = β₀ + β₁x + β₂x² is linear in the parameters (β₀, β₁, β₂), and all OLS theory still applies. This is the basis of polynomial regression and illustrates why feature engineering matters as much as model selection.

Regularization addresses a practical limitation of OLS: when features are highly correlated (multicollinearity) or when p is close to n, (XᵀX)⁻¹ becomes numerically unstable and the model overfits. Ridge regression adds a penalty λ‖β‖² to the objective, shrinking coefficients toward zero and stabilizing the solution. Lasso adds λ‖β‖₁, which drives some coefficients exactly to zero, performing automatic feature selection. Both are still "linear regression" in the same model class — same structure, different objective function — and both are solved efficiently with gradient descent.

Practice Questions 3 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Linear Regression in Machine Learning

Longest path: 91 steps · 624 total prerequisite topics

Prerequisites (10)

Matrices Introductionhard Expected Valuehard Least Squares Approximation and Normal Equationshard Matrix Operationshard Eigenvalues and Eigenvectorssoft Linear Transformationssoft Linear Systems: Notation and Solution Existencesoft Descriptive Statistics Synthesissoft Expected Value and Variancesoft Fairness in Machine Learningsoft

Leads To (4)

Decision Boundaries in Classificationhard Logistic Regression for Classificationhard Neural Network Fundamentalshard Support Vector Regressionhard