Ridge, Lasso, and Elastic Net Regression

College Depth 83 in the knowledge graph I know this Set as goal
regularization ridge lasso elastic-net

Core Idea

Ridge (L2), Lasso (L1), and Elastic Net add penalty terms to OLS loss. Ridge shrinks all coefficients; Lasso zeros out weak variables; Elastic Net combines both. These methods address multicollinearity and perform variable selection.

How It's Best Learned

Fit models with varying penalty parameters (lambda) and plot coefficient paths. Use cross-validation to choose the optimal lambda that balances fit and parsimony.

Explainer

Standard OLS finds the coefficient vector that minimizes the sum of squared residuals — it fits the data as closely as possible, with no other constraint. When you have many predictors, especially correlated ones (multicollinearity, your prerequisite), OLS develops a problem: it will assign large and opposite-signed coefficients to correlated variables, chasing noise in the sample to marginally improve fit. The estimates become numerically unstable and virtually useless for interpretation or prediction on new data. Regularization is the solution — deliberately accept a little more bias in exchange for much lower variance.

Ridge regression adds a penalty term to the OLS loss function: instead of minimizing Σ(yᵢ - ŷᵢ)², it minimizes Σ(yᵢ - ŷᵢ)² + λΣβⱼ² (the L2 penalty). The λ parameter controls how harsh the penalty is. When λ = 0, you get standard OLS. As λ increases, coefficients are pulled ("shrunk") toward zero. Crucially, ridge shrinks all coefficients proportionally but never eliminates any entirely — you always retain p predictors in the model. This makes ridge ideal when many variables each contribute a small signal and you want to dampen their collective noise.

Lasso (Least Absolute Shrinkage and Selection Operator) uses an L1 penalty instead: Σ(yᵢ - ŷᵢ)² + λΣ|βⱼ|. The absolute value rather than squared penalty has a geometric consequence: the constraint region has corners at the axes, and the optimal solution often sits exactly at a corner where some βⱼ = 0. Lasso therefore performs automatic variable selection — it zeros out weak predictors entirely, producing sparse models. If you believe only a subset of your variables genuinely matter, lasso is the more appropriate tool.

Elastic Net blends both penalties: λ₁Σ|βⱼ| + λ₂Σβⱼ². It inherits lasso's sparsity property while retaining ridge's ability to handle groups of correlated predictors (lasso arbitrarily picks one from a correlated group; elastic net can retain all of them with dampened coefficients). In practice, the choice among the three depends on the problem: many small signals favor ridge, a sparse signal favors lasso, and correlated predictors with an unknown structure favor elastic net.

The key insight unifying all three is the bias-variance tradeoff. Increasing λ introduces bias (coefficients drift from their true values) but reduces variance (the model responds less to sample-specific noise). The optimal λ is typically found through k-fold cross-validation: fit the model at many λ values, evaluate out-of-sample prediction error at each, and choose the λ that minimizes that error. This is where the discipline of regularization lives — not in the penalty algebra, but in the principled use of held-out data to tune the tradeoff.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesNormal DistributionCentral Limit TheoremConfidence Intervals for MeansZ-Tests and T-Tests for MeansOne-Sample Z-Test for MeansOne-Sample and Two-Sample T-TestsOne-Way ANOVAF-Test and Joint SignificanceR-Squared and Model FitMulticollinearityRidge, Lasso, and Elastic Net Regression

Longest path: 84 steps · 422 total prerequisite topics

Prerequisites (2)

Leads To (0)

No topics depend on this one yet.