Both gradient boosting and random forests use ensembles of decision trees. What is the most fundamental architectural difference between the two methods?
BGradient boosting trains trees sequentially, each correcting the errors of the previous ensemble; random forests train trees independently in parallel and average their predictions
CGradient boosting reduces variance; random forests reduce bias
DRandom forests are restricted to squared-error loss; gradient boosting can use any loss function
The core distinction is sequential vs. parallel. Random forests use bagging — each tree is trained independently on a bootstrap sample, and predictions are averaged, primarily reducing variance. Gradient boosting uses boosting — each tree is trained to correct the residuals (negative gradient) of all previous trees, primarily reducing bias. While gradient boosting can use any differentiable loss and supports better hyperparameter tuning, the sequential vs. parallel architecture is the defining difference from which everything else follows.
Question 2 Multiple Choice
When gradient boosting uses absolute error loss instead of squared error, each new tree is fitted to which target values?
AThe original target values, to ensure the tree sees the full signal
BThe negative gradient of the absolute error loss evaluated at each data point's current predicted value (the pseudo-residuals)
CA bootstrap-reweighted sample with misclassified examples upweighted, as in AdaBoost
DThe Hessian of the loss function, enabling second-order optimization at each step
For squared error, the residual equals the negative gradient — so fitting residuals and fitting the negative gradient are identical in that special case. For any other loss function, gradient boosting fits trees to the negative gradient of the loss (the pseudo-residuals), not the raw residual. This is why the method is called 'gradient' boosting and why it generalizes to classification, quantile regression, and ranking tasks: the pseudo-residuals adapt to whatever loss function is being minimized. AdaBoost reweights examples (C); XGBoost uses the Hessian additionally (D) for better split-finding, but the base algorithm fits the negative gradient.
Question 3 True / False
Reducing the learning rate in gradient boosting usually decreases final model accuracy because each tree contributes less to the ensemble.
TTrue
FFalse
Answer: False
A smaller learning rate typically improves generalization accuracy, provided the number of trees is increased accordingly. A small learning rate makes each additive step more conservative, acting as regularization that prevents large, overconfident updates. The tradeoff is computational: more trees are needed to reach the same training loss. Standard practice is to use a small learning rate (e.g., 0.05) and determine the optimal number of trees via early stopping on a validation set. This combination consistently outperforms a large learning rate with fewer trees.
Question 4 True / False
In gradient boosting, each tree is trained to predict the original target values, and the residuals from each tree are used primarily to select subsequent tree split points.
TTrue
FFalse
Answer: False
Each tree in gradient boosting is explicitly trained to predict the current residuals (or negative gradients) — these ARE the target values for each successive tree, not just a criterion for split selection. The tree's structure and leaf values are both fitted to minimize the residuals of the current ensemble. Only the first prediction uses the original targets (typically set to the mean for regression); every subsequent tree fits a supervised signal derived entirely from the current ensemble's errors, not the original labels.
Question 5 Short Answer
Explain why gradient boosting is called 'gradient' boosting — what gradient is being computed, and in what space is gradient descent being performed?
Think about your answer, then reveal below.
Model answer: Gradient boosting performs gradient descent in function space rather than parameter space. The gradient is not computed with respect to model weights but with respect to the prediction function itself — evaluated pointwise at each training example. Specifically, for each data point, the negative gradient of the loss function at its current predicted value gives the direction in which that prediction should move to decrease loss. Each new tree fits these pseudo-gradients, updating the prediction function one additive step in the direction of steepest descent.
This framing explains the method's generality: for any differentiable loss, compute the pointwise gradient and fit a tree to it. For squared error loss (y − ŷ)², the gradient is −(y − ŷ), so fitting the negative gradient is identical to fitting residuals — that's the special case that makes the connection to 'fitting residuals' intuitive but misleading as a general description. For absolute error, the pseudo-gradients are ±1 (the sign of each error). The function-space framing unifies many boosting algorithms under one theoretical framework and clarifies why learning rate and tree count trade off directly.