A data scientist computes feature importance scores using the entire dataset (training + test combined), selects the top 15 features, then trains and evaluates a model on the train/test split. The test accuracy looks excellent. What is the most likely problem with this workflow?
AUsing too many features always causes overfitting, regardless of how they were selected
BFeature importance scores computed on the full dataset leak test set information into the selection step, inflating performance estimates that will not hold on truly unseen data
CThe feature selection step should always come after model evaluation, not before
DImportance scores are only valid for tree-based models, not other algorithms
This is data leakage through feature selection. When you compute importance or correlation scores on the full dataset, the test set's patterns influence which features are chosen. The model then appears to perform well on a test set that was already used (indirectly) to select its inputs. On genuinely unseen data, performance will be worse. The correct procedure is to fit all preprocessing steps — including feature selection — using only training data, then apply the selection to the test set without re-fitting. This is one of the most common sources of inflated results in applied ML.
Question 2 Multiple Choice
You are building a model with hundreds of candidate features and cannot afford to repeatedly train the full model for wrapper-based selection. Which selection method is most appropriate, and what is its main limitation?
AEmbedded methods like Lasso — but they require the target variable to be continuous
BWrapper methods like forward selection — but they are computationally cheap and always preferred
CFilter methods using statistical tests (correlation, mutual information) — but they evaluate features independently and miss interaction effects between features
DDomain knowledge alone — algorithmic selection is only valid for large datasets
Filter methods score each feature individually against the target using statistical measures, without training the actual model. This makes them fast and scalable, which is exactly what you need when repeated full-model training is too expensive. Their key limitation is that they evaluate features in isolation: a feature that is useless alone but powerful in combination with another feature (an interaction effect) will be missed. Embedded methods like Lasso overcome this but require model training; wrapper methods capture interactions but are computationally prohibitive at scale.
Question 3 True / False
Adding more features to a model generally improves performance because the model can typically learn to ignore features that are irrelevant.
TTrue
FFalse
Answer: False
This is the 'curse of dimensionality' misconception. While some models (like L1-regularized models) can theoretically suppress irrelevant features, in practice irrelevant features add noise that models may overfit to, especially with limited training data. Redundant correlated features waste model capacity. High dimensionality increases the search space the model must navigate, degrading generalization. Feature selection is valuable precisely because fewer, better features typically lead to simpler, more generalizable models — not because models cannot handle many features in theory.
Question 4 True / False
Performing feature selection using only training data, then applying the same selection to the test set, is a valid and complete safeguard against data leakage in feature selection.
TTrue
FFalse
Answer: True
This is the correct protocol. Feature selection (like all preprocessing steps) must be 'fit' on training data only — meaning the statistical scores, importance values, or regularization weights that determine which features are kept are computed using no test set information. The selected feature indices are then applied to the test set as a fixed transformation, without re-computing. This ensures the test set remains a true holdout that represents genuinely unseen data, giving unbiased performance estimates.
Question 5 Short Answer
Why does feature engineering often matter more than algorithm choice in applied machine learning, and what is the guiding question when deciding whether to create a new feature?
Think about your answer, then reveal below.
Model answer: Feature engineering matters more than algorithm choice because models learn patterns that are explicitly present in their input representation. A complex algorithm cannot discover a relationship that the features do not expose — but a simple model on well-engineered features can outperform a sophisticated model on raw data because the key pattern is already visible. The guiding question when creating a new feature is: 'What transformation would make the pattern I expect to find linearly separable or more obvious to the model?' Features should encode domain knowledge directly, making implicit structure explicit.
This insight cuts against the common impulse to try more powerful algorithms first. The information bottleneck is usually the representation, not the model capacity. If 'age' matters nonlinearly (very young and very old both having high risk), squaring it exposes that relationship to even a linear model. If the ratio of two quantities matters more than either individually, create that ratio explicitly. Algorithmic improvements are bounded by what information the features contain; feature engineering expands the information ceiling itself.