OLS estimates the line of best fit by minimizing a specific objective function. What does it minimize, and why is that criterion used rather than minimizing the sum of raw residuals?
Think about your answer, then reveal below.
Model answer: OLS minimizes the sum of squared residuals, Σ(yᵢ − ŷᵢ)². The sum of raw residuals is not used because positive and negative errors cancel out — any line through the mean of the data has a zero sum of residuals, so this criterion cannot distinguish a good fit from a bad one. Squaring the residuals eliminates cancellation, penalizes large errors more heavily than small ones, and produces a unique, analytically solvable minimum.
The cancellation problem is the key insight. If you simply sum (yᵢ − ŷᵢ), a line that systematically over-predicts half the data and under-predicts the other half by equal amounts scores the same as a line with no error at all. Squaring forces all contributions to be non-negative, so the only way to drive the sum toward zero is to make every individual residual small. This also explains why OLS is sensitive to outliers — a single point far from the line contributes a very large squared residual that the estimator works hard to reduce.