A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Ridge, Lasso, and Elastic Net Regression

College Depth 115 in the knowledge graph ☐ I know this ☆ Set as goal

584prerequisites beneath it

Core Idea

Ridge (L2), Lasso (L1), and Elastic Net add penalty terms to OLS loss. Ridge shrinks all coefficients; Lasso zeros out weak variables; Elastic Net combines both. These methods address multicollinearity and perform variable selection.

How It's Best Learned

Fit models with varying penalty parameters (lambda) and plot coefficient paths. Use cross-validation to choose the optimal lambda that balances fit and parsimony.

Explainer

Standard OLS finds the coefficient vector that minimizes the sum of squared residuals — it fits the data as closely as possible, with no other constraint. When you have many predictors, especially correlated ones (multicollinearity, your prerequisite), OLS develops a problem: it will assign large and opposite-signed coefficients to correlated variables, chasing noise in the sample to marginally improve fit. The estimates become numerically unstable and virtually useless for interpretation or prediction on new data. Regularization is the solution — deliberately accept a little more bias in exchange for much lower variance.

Ridge regression adds a penalty term to the OLS loss function: instead of minimizing Σ(yᵢ - ŷᵢ)², it minimizes Σ(yᵢ - ŷᵢ)² + λΣβⱼ² (the L2 penalty). The λ parameter controls how harsh the penalty is. When λ = 0, you get standard OLS. As λ increases, coefficients are pulled ("shrunk") toward zero. Crucially, ridge shrinks all coefficients proportionally but never eliminates any entirely — you always retain p predictors in the model. This makes ridge ideal when many variables each contribute a small signal and you want to dampen their collective noise.

Lasso (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty instead: Σ(yᵢ - ŷᵢ)² + λΣ|βⱼ|. The absolute value rather than squared penalty has a geometric consequence: the constraint region has corners at the axes, and the optimal solution often sits exactly at a corner where some βⱼ = 0. Lasso therefore performs automatic variable selection — it zeros out weak predictors entirely, producing sparse models. If you believe only a subset of your variables genuinely matter, lasso is the more appropriate tool.

Elastic Net blends both penalties: λ₁Σ|βⱼ| + λ₂Σβⱼ². It inherits lasso's sparsity property while retaining ridge's ability to handle groups of correlated predictors (lasso arbitrarily picks one from a correlated group; elastic net can retain all of them with dampened coefficients). In practice, the choice among the three depends on the problem: many small signals favor ridge, a sparse signal favors lasso, and correlated predictors with an unknown structure favor elastic net.

The key insight unifying all three is the bias-variance tradeoff. Increasing λ introduces bias (coefficients drift from their true values) but reduces variance (the model responds less to sample-specific noise). The optimal λ is typically found through k-fold cross-validation: fit the model at many λ values, evaluate out-of-sample prediction error at each, and choose the λ that minimizes that error. This is where the discipline of regularization lives — not in the penalty algebra, but in the principled use of held-out data to tune the tradeoff.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Independence of Events → Sampling Distributions → Standard Error of Estimators → Hypothesis Testing: Framework and Logic → P-values and Statistical Significance → Effect Size and Practical Significance → Hypothesis Testing: Framework and Logic → Z-Tests and T-Tests for Means → One-Sample Z-Test for Means → One-Sample and Two-Sample T-Tests → Inference in Linear Regression → Prediction Intervals in Regression → Linear Regression Basics → Residuals and Goodness of Fit (R²) → Simple (Bivariate) OLS Regression → Classical OLS Assumptions (Gauss-Markov) → Multiple Regression → Interpreting Regression Coefficients → Hypothesis Testing in Regression → F-Test and Joint Significance → R-Squared and Model Fit → Multicollinearity → Ridge, Lasso, and Elastic Net Regression

Longest path: 116 steps · 584 total prerequisite topics

Prerequisites (2)

Multiple Regressionhard Multicollinearityhard

Leads To (0)

No topics depend on this one yet.