A single deep decision tree achieves 100% accuracy on training data but only 70% on a held-out test set. A random forest of 500 trees achieves 93% training accuracy and 89% test accuracy. What best explains why the forest outperforms the single tree on test data?
AThe forest uses more training data because each tree sees a bootstrap sample larger than the original dataset
BEach tree in the forest is shallower and therefore has higher bias, which generalizes better
CThe forest averages many high-variance, decorrelated trees, reducing overall variance while preserving low bias
DThe forest eliminates all irrelevant features, leaving only the most predictive ones
The single tree has high variance — it memorized the training data, including noise. Random forests reduce variance by averaging many trees whose errors are not correlated (each tree makes different mistakes due to random feature subsets and bootstrap sampling). When uncorrelated errors are averaged, they cancel out; coherent signal is preserved. Option A is wrong — bootstrap samples are the same size as the original data. Option B is wrong — deep trees still have low bias in a forest; the gain comes from variance reduction, not bias increase. Option D is a side effect, not the core mechanism.
Question 2 Multiple Choice
What is the primary purpose of selecting only a random subset of features at each split in a random forest, rather than considering all features?
AIt speeds up training by reducing computation at each node
BIt forces each tree to use every feature at least once, ensuring full coverage
CIt decorrelates the trees so that their errors are independent and cancel when averaged
DIt prevents any single tree from overfitting by limiting its information access
The key insight is decorrelation. If all trees were trained on the same features (even on different bootstrap samples), strong predictors would dominate every tree's first split, making the trees highly correlated — they would make the same mistakes on the same examples, and averaging would not help. By randomly restricting features at each split, some trees are forced to build their first split around secondary predictors, creating diverse, decorrelated trees whose errors are partially independent. When averaged, independent errors cancel while signal accumulates. Speed (option A) is a true side effect, but not the primary purpose.
Question 3 True / False
Adding more trees to a random forest will eventually cause it to overfit the training data, just as a single deep tree does.
TTrue
FFalse
Answer: False
False. This is a common misapplication of the intuition that 'more complexity = more overfitting.' In a random forest, each additional tree is an independent high-variance estimator, and averaging them reduces variance monotonically — adding trees cannot increase variance (and thus cannot cause overfitting). The training accuracy may stay high, but test accuracy plateaus rather than declining. This is in sharp contrast to a single tree, where more depth directly increases complexity and overfitting. The practical consequence is that the 'number of trees' hyperparameter is safe to set large; you never need to worry about 'too many.'
Question 4 True / False
Random forests preserve interpretability because you can inspect the individual trees and trace the decision path for any prediction.
TTrue
FFalse
Answer: False
False. A single decision tree is interpretable — you can follow the sequence of splits from root to leaf for any input. But a random forest aggregates hundreds or thousands of trees; no single decision path explains a prediction, and the ensemble vote is a black box. Feature importance scores (measuring average impurity reduction per feature across all trees) partially recover a sense of variable importance, but this is a summary statistic, not an explanation of individual predictions. This interpretability tradeoff is one of the main practical reasons to choose a single tree over a forest when transparency is required.
Question 5 Short Answer
Explain why averaging many decision trees reduces prediction error. What role does the 'random feature subset' step play, and what would happen if it were removed?
Think about your answer, then reveal below.
Model answer: Each individual tree has high variance — small changes in training data produce very different trees. When many high-variance estimators whose errors are uncorrelated are averaged, variance decreases (errors cancel) while bias is unchanged. The random feature subset step is what creates the decorrelation: without it, all trees would tend to put the strongest predictors at their root, producing highly correlated trees that make the same errors and gain little from averaging. With random feature subsets, trees are forced to differ structurally, making their errors more independent and the averaging more effective at noise reduction.
The statistical principle is that the variance of an average of n independent random variables with variance σ² is σ²/n, while the average of n perfectly correlated variables has the same variance σ². Real random forest trees are neither perfectly independent nor perfectly correlated — they fall somewhere in between, so variance reduction is real but not as large as if trees were independent. The random feature subset step pushes trees toward lower correlation by preventing any single dominant feature from structuring all trees identically.