Questions: Cross-Validation and Out-of-Sample Model Evaluation
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A researcher compares two models: Model A has 3 predictors and in-sample R² = 0.73. Model B has 25 predictors and R² = 0.91. Model B's 10-fold cross-validation error is 60% higher than Model A's. For a forecasting application, which model should they choose?
AModel B — higher R² always indicates a better-fitting, more accurate model
BModel A — lower CV error means it generalizes better to new data
CModel B — more predictors capture more real variation in the data
DIt depends on which individual coefficients are statistically significant
Model B's high R² likely reflects overfitting — its 25 predictors have learned the noise in the training sample rather than the true underlying pattern. A 60% higher CV error means Model B performs much worse on data it hasn't seen. For forecasting, cross-validation error — not in-sample R² — is the correct performance metric. This is the central lesson: R² measures how well the model describes the past; CV error estimates how well it predicts the future.
Question 2 Multiple Choice
What specific problem does cross-validation detect that in-sample R² cannot?
AWhether the model's coefficients are statistically significant at the 5% level
BWhether the model has overfitted — fitting noise specific to the sample rather than the true underlying pattern
CWhether omitted variable bias is affecting the coefficient estimates
DWhether the error terms satisfy the Gauss-Markov homoskedasticity assumption
In-sample R² is measured on the same data used to estimate the model. Adding any variable — even pure noise — will improve R². This means R² rewards complexity regardless of whether that complexity reflects real signal. Cross-validation simulates out-of-sample prediction by actually withholding data during estimation: a model that overfit will nail the training folds but fail on the held-out fold, and this shows up as high CV error. R² cannot detect this because it never tests the model on data it didn't train on.
Question 3 True / False
Adding more predictor variables to a regression usually improves out-of-sample predictive performance because additional variables cannot reduce the model's explanatory power.
TTrue
FFalse
Answer: False
This confuses in-sample and out-of-sample performance. Adding variables always improves (or at worst maintains) in-sample R², because more parameters give the model more flexibility to fit the existing data. But out-of-sample, additional variables can hurt by fitting noise specific to the training sample — when the model encounters new data where that noise pattern doesn't repeat, its predictions worsen. Cross-validation reveals this by actually measuring performance on held-out data.
Question 4 True / False
A model selected by minimizing cross-validation error will typically outperform a model selected by maximizing in-sample R² when making predictions on new data.
TTrue
FFalse
Answer: True
This is precisely the purpose of cross-validation. Maximizing in-sample R² tends to select overly complex models that fit the sample's idiosyncrasies. Minimizing CV error selects models that perform well on data they haven't seen — which is the definition of good generalization. The two selection criteria agree only when models don't overfit; when they disagree, CV error is the more reliable guide for prediction tasks.
Question 5 Short Answer
Explain why in-sample R² is a misleading measure of a model's predictive quality, and what cross-validation reveals instead.
Think about your answer, then reveal below.
Model answer: R² is measured on the same data used to fit the model, so it rewards complexity: any additional variable improves R² even if it's pure noise. A model with as many parameters as observations achieves R² = 1.0 while predicting new data no better than chance. Cross-validation simulates out-of-sample prediction by holding out portions of the data during estimation and measuring error on what the model never saw. This penalizes complexity automatically — overfit models perform well on training folds but fail on held-out folds.
The key distinction is what each metric is measuring. R² answers: 'How well does the model describe this data?' CV error answers: 'How well will the model predict data it hasn't seen?' For forecasting, the second question is what matters. This is why modern machine learning and econometric forecasting practice uses CV error (or related metrics like AIC/BIC that penalize complexity) rather than R² as the model selection criterion.