Questions: Ridge, Lasso, and Elastic Net Regression
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
You have a dataset with 200 candidate predictors and believe only about 20 are genuinely related to the outcome. Which regularization method is most appropriate?
ARidge regression, because it handles large numbers of predictors by shrinking all coefficients
BOLS, because you need unbiased estimates to identify the true 20 predictors
CLasso regression, because it performs automatic variable selection by driving some coefficients to exactly zero
DElastic Net, because you always need both L1 and L2 penalties when predictors outnumber observations
When the true signal is sparse — only a small fraction of predictors matter — Lasso is the natural choice. Its L1 penalty produces sparse solutions by driving weak coefficients to exactly zero, effectively selecting variables automatically. Ridge retains all predictors with shrunk coefficients, better when many variables each contribute small signals. OLS with 200 predictors would overfit severely. Elastic Net is most useful when correlated predictors need to be retained or excluded in groups.
Question 2 Multiple Choice
Why does Lasso drive some coefficients to exactly zero while Ridge only shrinks them toward (but never to) zero?
ALasso uses a larger default penalty parameter λ, forcing more shrinkage
BThe L1 constraint region has corners at the coordinate axes; the optimization solution often lands exactly on a corner where a coefficient is zero
CLasso uses an iterative algorithm that terminates early, leaving some coefficients unupdated
DRidge uses squared penalties which are stronger than absolute-value penalties and push coefficients further from zero
The geometric intuition is key. The L2 (Ridge) constraint region is a smooth sphere with no corners — the OLS loss contours touch it at a point where all coordinates are nonzero. The L1 (Lasso) constraint region is a diamond (in 2D) with corners exactly on the coordinate axes. The loss contours are likely to first touch this region at a corner, where one or more coordinates are exactly zero. This geometric property — not just the strength of the penalty — produces sparsity.
Question 3 True / False
Increasing the regularization parameter λ in Ridge regression always increases the model's bias while decreasing its variance.
TTrue
FFalse
Answer: True
This is the bias-variance tradeoff at the heart of regularization. Higher λ pulls coefficients further from OLS estimates (which minimize in-sample fit), introducing bias — the model no longer perfectly chases the training data's idiosyncratic patterns. At the same time, the model becomes less sensitive to the specific sample, reducing variance. At λ = 0, Ridge equals OLS (unbiased, high variance); as λ → ∞, all coefficients → 0 (maximum bias, near-zero variance). Optimal λ balances these forces.
Question 4 True / False
Ridge regression is the preferred regularization method when you believe primarily a sparse subset of predictors is truly relevant to the outcome.
TTrue
FFalse
Answer: False
This describes the ideal scenario for Lasso, not Ridge. Ridge shrinks all coefficients but keeps every predictor in the model — it never produces a sparse solution. When the true signal is sparse, Ridge assigns small but nonzero coefficients to all irrelevant predictors, adding noise and complicating interpretation. Lasso's automatic variable selection directly suits this scenario. Ridge is preferable when many predictors each contribute small signals and you want to dampen collective noise without eliminating any.
Question 5 Short Answer
Explain the bias-variance tradeoff in regularization and describe how cross-validation is used to choose the optimal penalty parameter λ.
Think about your answer, then reveal below.
Model answer: Regularization introduces bias by penalizing large coefficients, forcing them toward zero and away from the OLS estimates that minimize in-sample fit. This bias reduces variance: the model is less sensitive to noise in the specific training sample and generalizes better to new data. The optimal λ balances these two forces. Cross-validation finds this optimum empirically: the data is split into k folds, the model is fit on k−1 folds at each λ value, and prediction error on the held-out fold is measured. The λ that minimizes average out-of-sample error is chosen.
In-sample fit always improves as λ decreases (more flexibility), but out-of-sample fit has a U-shape: too little regularization overfits, too much underfits. Cross-validation finds the λ at the bottom of that U-shape, making regularization a principled, data-driven procedure rather than an ad hoc tuning choice.