A model achieves near-perfect training accuracy but performs poorly on unseen test data. You apply L1 regularization. Which best explains how L1 addresses the problem?
AIt increases model capacity so the model fits both training and test distributions better
BIt penalizes the absolute value of weights, driving some to exactly zero and reducing effective model complexity
CIt averages predictions across many sub-models trained on different random subsets
DIt adds noise to training labels to prevent the model from memorizing specific examples
The problem is overfitting — the model has memorized training noise rather than learned generalizable patterns. L1 regularization adds λ·Σ|wᵢ| to the loss, penalizing weight magnitudes. Because the L1 penalty has a diamond-shaped constraint region, optimal solutions often land exactly at zero for some weights, effectively performing feature selection and reducing complexity. Option C describes bagging (ensemble methods), and option D is a different technique (label smoothing).
Question 2 Multiple Choice
You are training a linear model on 1,000 features but suspect only 20 are truly informative. Which regularizer is most appropriate, and why?
AL2, because it shrinks all weights equally and makes the model more numerically stable
BL1, because it can drive irrelevant feature weights to exactly zero, performing automatic feature selection
CDropout, because it randomly deactivates neurons during training, implicitly ignoring irrelevant features
DEarly stopping, because halting before convergence prevents the model from learning irrelevant features
When you have many features and suspect most are irrelevant, L1 is the right tool. The geometry of the L1 penalty (corners of a hyperdiamond touching the axes) means the optimal solution is often sparse — weights for irrelevant features go to exactly zero. L2 shrinks all weights toward zero proportionally but rarely eliminates any entirely, so all 1,000 features contribute weakly. Dropout and early stopping are valid regularizers but do not perform explicit feature selection.
Question 3 True / False
L2 regularization shrinks weights toward zero but rarely sets them to exactly zero, while L1 regularization can produce exactly zero weights.
TTrue
FFalse
Answer: True
This is a fundamental geometric difference. L2 adds a smooth quadratic penalty, so the gradient of the penalty is proportional to the weight — as a weight approaches zero, the gradient also approaches zero, giving no 'push' all the way to zero. L1 adds an absolute value penalty with a constant gradient (±λ), which applies equal pressure regardless of weight magnitude and can push weights exactly to zero. This is why L1 produces sparse models and is used for feature selection.
Question 4 True / False
Regularization improves a model's training accuracy by penalizing overly complex solutions.
TTrue
FFalse
Answer: False
Regularization deliberately worsens training accuracy slightly. By adding a penalty term that discourages large weights or complexity, the model is prevented from fitting the training data as tightly as it could — which is the point. The goal is to accept a small increase in training loss in exchange for a large decrease in test loss (generalization error). A regularized model is intentionally biased toward simpler solutions, trading training performance for generalization.
Question 5 Short Answer
Why does regularization improve generalization even though it makes the model fit the training data less well?
Think about your answer, then reveal below.
Model answer: Because the training data contains both the true underlying pattern and noise. An unregularized model with high capacity will fit both, memorizing the noise as if it were signal — this is overfitting. Regularization penalizes complexity, forcing the model toward simpler hypotheses that explain the training data without fitting every fluctuation. Simpler models that ignore noise generalize better because the noise doesn't appear in the test data; only the true pattern does.
This is the bias-variance tradeoff in action. Regularization introduces a small amount of bias (the model is nudged away from the exact training-data optimum) but substantially reduces variance (sensitivity to the specific training samples). If the true pattern is simpler than the model's full capacity, this tradeoff is favorable. The regularization hyperparameter λ controls the balance: too little leaves the model overfitting; too much underfits by over-constraining it.