You use 10-fold cross-validation to choose between model A (CV error: 5%) and model B (CV error: 4%). You select model B and report its 4% cross-validated error as your final model's performance. What is wrong with this workflow?
ANothing — 10-fold CV gives the best possible performance estimate
BYou should have used leave-one-out CV instead of 10-fold
CThe final model should be retrained on all data after hyperparameter selection, and reporting CV error as final performance conflates model selection with model evaluation
DCross-validation can only be used for binary classification, not regression
Cross-validation selects hyperparameters by estimating which settings generalize best — but the models trained during CV each used only a fraction of the data. The correct workflow is: (1) use CV to select hyperparameters, then (2) retrain the final model on ALL available data using those hyperparameters. The CV error estimates generalization accuracy, not the final model's specific performance. Reporting the CV error as the final model's performance conflates model selection with model evaluation and describes a model you never actually deployed.
Question 2 Multiple Choice
For time-series data, why can't you use standard k-fold cross-validation where folds are created by random sampling?
ATime-series data always has too few observations for k-fold to work
BRandom folds may train on future data to predict past data, violating causal ordering and inflating performance estimates
CTime-series variables are too correlated across time for cross-validation to reduce variance
DStandard k-fold assumes independent observations, which is violated, but this only affects computational efficiency
In time-series problems, future values cannot be used to predict past values — this is data leakage that makes the model look far better than it will perform on genuinely unseen future data. Standard k-fold randomly assigns each observation to folds without respect to time, so a model might 'train' on 2023 data to predict 2022 observations. Time-series splits (expanding window or sliding window) enforce that training data always precedes test data, giving honest estimates of forward-looking performance.
Question 3 True / False
Increasing k in k-fold cross-validation generally produces better (lower-variance) performance estimates.
TTrue
FFalse
Answer: False
False. Increasing k involves its own bias-variance tradeoff for the error estimate. Large k means each fold trains on nearly all the data, reducing bias in the error estimate. But the k training sets become highly overlapping, making the individual fold estimates highly correlated — this increases the variance of the average. Very large k can produce a higher-variance error estimate than moderate k. k = 5 or k = 10 is a well-established practical sweet spot, not the largest k possible.
Question 4 True / False
Cross-validation can provide an unbiased estimate of model performance even when the same data is used for both hyperparameter tuning and error reporting.
TTrue
FFalse
Answer: False
When cross-validation is used to tune hyperparameters, the CV error is optimistically biased if also reported as the final performance estimate — because the hyperparameters were chosen to minimize that very error. This is 'double dipping.' To get an unbiased performance estimate, a held-out test set (never used for tuning) is required, or nested cross-validation (outer loop for evaluation, inner loop for tuning) must be used.
Question 5 Short Answer
Why does k-fold cross-validation produce a more reliable generalization error estimate than a single random train/test split?
Think about your answer, then reveal below.
Model answer: A single split depends on the particular random partition — a lucky or unlucky split can make the model look much better or worse than it truly is. k-fold averages k separate error estimates, each from a different test fold, which reduces the variance of the overall estimate. Every data point appears in exactly one test fold, so all the data contributes to evaluation rather than just a held-out subset. This averaging over multiple evaluations smooths out the noise from any single split.
The key is that a single split gives you one sample from the distribution of possible train/test splits; k-fold gives you k samples and averages them. Variance decreases roughly as 1/k relative to the single-split case. This matters especially in small datasets where a single test set may be too small to give a reliable error estimate — random fluctuations in which examples end up in the test set dominate the error estimate.