A data scientist computes mean and standard deviation from the full dataset (train + test combined), then splits into train/test sets and applies that scaler to both. What is the problem with this approach?
AThe scaler will produce different ranges for training and test data, causing model instability
BInformation from the test set leaks into the training process, producing overly optimistic performance estimates that won't hold on truly unseen data
CStandardization cannot be applied after a train/test split — the data must be split first, then scaled separately with different scalers
DThis approach is fine as long as the test set is large enough to represent the population
Computing scaling parameters (mean, standard deviation) from the combined dataset leaks test-set information into the training process — this is data leakage. The model indirectly 'sees' the test set through the scaling parameters, producing performance estimates that look better than they would on truly unseen data. The correct approach: fit the scaler on training data only, then use its stored parameters to transform both training and test sets. This simulates the real deployment scenario where test data is unavailable during training.
Question 2 Multiple Choice
Which type of machine learning model is LEAST sensitive to whether features are scaled?
AK-nearest neighbors (KNN)
BSupport vector machine with RBF kernel
CLogistic regression
DRandom forest
Tree-based models like random forest split features at threshold values — the split at 'income > 50,000' is equivalent to 'scaled_income > 0.3' in terms of which samples it separates. The absolute scale doesn't change which split is optimal. Distance-based models (KNN, SVM with RBF kernel) compute distances between data points, so a feature with a large magnitude dominates distances and scaling is critical. Gradient-based models (logistic regression, neural networks) are also sensitive because large-magnitude features create steep loss-surface dimensions that impede convergence.
Question 3 True / False
Scaling must be applied inside each cross-validation fold — computing the scaler on the full training set before splitting into folds leaks information from each validation fold.
TTrue
FFalse
Answer: True
Cross-validation simulates evaluating on unseen data by holding out each fold as a validation set. If you fit the scaler on all training data before splitting into folds, the validation fold's statistics (mean, std) influence the scaling applied to it — this is a form of data leakage that makes cross-validation estimates optimistic. Proper practice: inside each fold, fit the scaler on the training portion of that fold only, then transform both portions using those parameters. This correctly simulates the scenario where each validation fold is truly unseen.
Question 4 True / False
Min-max normalization is preferred over standardization when the dataset contains significant outliers, because it compresses outliers into the [0, 1] range and prevents them from distorting the scaling.
TTrue
FFalse
Answer: False
This reverses the actual guidance. Min-max normalization is *more* sensitive to outliers, not less. An outlier at the extreme end of the range determines the min or max, compressing all other values into a small portion of [0, 1]. Standardization (z-score) is generally more robust to outliers because an outlier becomes a large z-score — it doesn't compress the rest of the data. Standardization is preferred when outliers are present; min-max normalization is better when the data is already bounded and you need values constrained to a fixed range (e.g., neural network inputs expected in [0, 1] with no extreme values).
Question 5 Short Answer
Why does failing to scale features harm distance-based algorithms like k-nearest neighbors, even when all features are genuinely important predictors?
Think about your answer, then reveal below.
Model answer: Distance-based algorithms compute geometric distance between data points to determine similarity. Without scaling, features with numerically large ranges dominate the distance calculation — a difference of 10,000 in income swamps a difference of 50 in age, even if both differences are equally meaningful predictively. The algorithm effectively ignores the small-range features entirely. Scaling puts all features on an equal numeric footing, so the distance metric reflects the actual similarity structure across all features rather than being hijacked by whichever feature happens to have the largest units.
This is especially problematic when features have different units (dollars, years, binary flags) where the numeric magnitude is arbitrary rather than meaningful. A feature measured in kilometers could be converted to meters and suddenly dominate all distances — yet nothing about the data has changed. Scaling ensures that algorithmic behavior depends on the actual information content of features, not their units of measurement. For gradient-based methods, the analogous problem is that unscaled features create an elongated, poorly conditioned loss surface that gradient descent navigates inefficiently.