Cross-Validation Techniques

Graduate Depth 65 in the knowledge graph I know this Set as goal
Unlocks 3 downstream topics
evaluation hyperparameter-tuning overfitting-prevention model-selection

Core Idea

Cross-validation partitions data into train/test folds to estimate generalization error and tune hyperparameters without wasting data on a separate validation set. Stratified k-fold preserves class distribution; time-series splits respect temporal order; cross-validation reduces variance in error estimates compared to a single train/test split.

How It's Best Learned

Implement k-fold cross-validation and observe how error estimates vary with fold size and how folds affect hyperparameter selection.

Explainer

From your study of the bias-variance tradeoff, you know that a model's performance on training data is an optimistic estimate of how it will perform on unseen data. The naive solution is to hold out a separate test set, but this wastes precious data — in a dataset of 500 examples, reserving 100 for testing means training on only 400, which may yield a worse model. Cross-validation addresses this by systematically rotating which data serves as the test set, so every example is used for both training and evaluation.

In k-fold cross-validation, you partition the data into k equally sized subsets (folds). You train the model k times, each time holding out one fold as the test set and training on the remaining k−1 folds. The k test-set error estimates are then averaged to produce a single performance metric. With k = 5, for example, each model trains on 80% of the data and tests on 20%, and every data point appears in exactly one test fold. This gives you a much more reliable error estimate than a single random split, because the variance of the estimate decreases — you are averaging over k independent evaluations rather than depending on the luck of one particular partition.

The choice of k involves its own bias-variance tradeoff. Large k (approaching leave-one-out, where k = n) uses nearly all data for training, reducing bias in the error estimate, but the k training sets overlap heavily, making the individual estimates highly correlated and increasing variance. Small k (like k = 2) produces more independent estimates but trains on less data, introducing bias. k = 5 or k = 10 has emerged as a practical default because it balances these concerns well. Stratified k-fold ensures each fold preserves the class distribution of the full dataset, which is important when classes are imbalanced — without stratification, a fold might accidentally contain no examples of a rare class. For time-series data, standard k-fold violates temporal ordering (training on future data to predict the past), so time-series splits use expanding or sliding windows that always train on past data and test on future data.

Cross-validation's most important application is model selection and hyperparameter tuning. When choosing between, say, a decision tree with max depth 5 versus depth 10, you cannot compare their training errors (the deeper tree will always win on training data). Instead, you compare their cross-validated errors, which estimate generalization performance. You select the hyperparameters that minimize cross-validated error, then retrain the final model on all available data using those hyperparameters. This workflow — cross-validate to select, then retrain on everything — extracts maximum value from limited data while providing honest performance estimates that guard against overfitting.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsGeometric Sequences and SeriesSigma NotationExpected ValueVariance and Standard Deviation of Random VariablesBias-Variance TradeoffCross-Validation Techniques

Longest path: 66 steps · 357 total prerequisite topics

Prerequisites (5)

Leads To (1)