When would you choose gradient descent over the normal equations to fit a linear regression model?
Think about your answer, then reveal below.
Model answer: When the number of features (p) is very large, because the normal equations require inverting a p×p matrix, which is O(p³) in time and O(p²) in memory. Gradient descent scales much better with large feature sets and is the standard approach when p can be thousands or millions.
The normal equation β = (XᵀX)⁻¹Xᵀy requires forming and inverting the p×p matrix XᵀX. For small datasets with few features this is fast and exact. But for high-dimensional problems, matrix inversion is computationally prohibitive. Gradient descent iteratively updates parameters using the loss gradient, costing only O(n·p) per step, making it far more scalable.