Linear Regression in Machine Learning

Graduate Depth 63 in the knowledge graph I know this Set as goal
Unlocks 90 downstream topics
supervised-learning regression optimization

Core Idea

Linear regression models continuous outputs as linear combinations of input features. Least squares minimizes squared error; solutions are computed analytically via normal equations or iteratively via gradient descent. Extensions include regularization and polynomial features.

Explainer

Linear regression is the foundational model of supervised learning: given input features and continuous output labels, find the linear combination of features that best predicts the output. Despite its simplicity, understanding it deeply — why it works, how it's solved, and when it fails — provides the conceptual scaffolding for nearly every more advanced method.

The model is ŷ = Xβ, where X is the feature matrix (n examples × p features, with a column of ones for the intercept), β is the parameter vector to learn, and ŷ is the prediction vector. The goal is to choose β minimizing the sum of squared residuals. This objective has a unique minimum (assuming XᵀX is invertible) given by the normal equations: β = (XᵀX)⁻¹Xᵀy. This is an exact, closed-form solution — no iteration required. The matrix XᵀX captures the covariance structure of the features, and its inverse "un-rotates" and "un-stretches" the feature space to align predictions with targets.

For large feature sets, computing (XᵀX)⁻¹ becomes too expensive — inverting a p×p matrix scales as O(p³). Gradient descent is the practical alternative: start with arbitrary β, compute the gradient of the loss with respect to β, and take a small step in the direction that reduces the loss. Repeat until convergence. Each step costs O(n·p), scaling gracefully to millions of features. This is the same mechanism used to train neural networks, which is why understanding gradient descent in the linear regression setting is non-negotiable before moving to deeper models.

A common misconception is that "linear regression can only model straight-line relationships." The linearity refers to the parameters, not the features. Adding x², log(x), or interaction terms to the feature matrix allows linear regression to fit curves and interactions. The model is still "linear" because ŷ = β₀ + β₁x + β₂x² is linear in the parameters (β₀, β₁, β₂), and all OLS theory still applies. This is the basis of polynomial regression and illustrates why feature engineering matters as much as model selection.

Regularization addresses a practical limitation of OLS: when features are highly correlated (multicollinearity) or when p is close to n, (XᵀX)⁻¹ becomes numerically unstable and the model overfits. Ridge regression adds a penalty λ‖β‖² to the objective, shrinking coefficients toward zero and stabilizing the solution. Lasso adds λ‖β‖₁, which drives some coefficients exactly to zero, performing automatic feature selection. Both are still "linear regression" in the same model class — same structure, different objective function — and both are solved efficiently with gradient descent.

Practice Questions 3 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueLinear Regression in Machine Learning

Longest path: 64 steps · 378 total prerequisite topics

Prerequisites (10)

Leads To (4)