Questions: Least Squares Approximation and Normal Equations
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
You have 100 data points and want to fit a line y = mx + c. Setting up the equation for each point gives a 100×2 system Ax = b. Why can't you solve this system exactly?
AThe system is underdetermined — with only 2 unknowns and 100 equations, there are infinitely many solutions
BThe system is overdetermined — with 100 equations and 2 unknowns, no single line can pass through all 100 points exactly (unless they are perfectly collinear)
CThe system cannot be solved because A is not a square matrix
DThe system can always be solved exactly; least squares is just an optimization technique for speed
With 100 equations and 2 unknowns, the system is overdetermined: there are far more constraints than degrees of freedom. Unless all 100 data points happen to lie exactly on a single line (an almost impossible coincidence with real data), no single pair (m, c) satisfies all 100 equations simultaneously — the system is inconsistent. Least squares finds the best approximate solution by minimizing the total squared error. Option A confuses 'more equations than unknowns' with underdetermination (more unknowns than equations). Option C is true but not the reason — non-square matrices can still have exact solutions if the system is consistent.
Question 2 Multiple Choice
What is the geometric interpretation of the least squares solution x̂ to an inconsistent system Ax = b?
Ax̂ minimizes the number of equations that are violated
Bx̂ is the vector such that Ax̂ is the orthogonal projection of b onto the column space of A
Cx̂ is the midpoint between the closest two exact solutions
Dx̂ minimizes the maximum error across all equations
The column space of A is the set of all vectors Ax can produce. Since b is not in this subspace (the system is inconsistent), the best approximation Ax̂ is the point in the column space closest to b — its orthogonal projection. The residual b − Ax̂ is then perpendicular to the column space, which is why every column of A is orthogonal to the residual. This orthogonality condition, written as A^T(b − Ax̂) = 0, immediately gives the normal equations A^TAx̂ = A^Tb. Option D describes the minimax criterion (used in L∞ optimization), not least squares.
Question 3 True / False
In the least squares solution to Ax = b, the residual vector b − Ax̂ is orthogonal to every column of A.
TTrue
FFalse
Answer: True
This is the fundamental geometric fact that generates the normal equations. The least squares solution Ax̂ is the orthogonal projection of b onto the column space of A. By definition of orthogonal projection, the vector from the projected point back to b — the residual b − Ax̂ — must be perpendicular to everything in the column space, including each column of A. Writing this as A^T(b − Ax̂) = 0 gives A^TAx̂ = A^Tb: the normal equations. The entire derivation of least squares follows from this one orthogonality condition.
Question 4 True / False
Computing the normal equations by forming A^TA directly is generally numerically preferable to using QR decomposition because it reduces the size of the matrix.
TTrue
FFalse
Answer: False
Forming A^TA directly is numerically inferior to QR decomposition. The condition number of A^TA is the square of the condition number of A — meaning floating-point errors are amplified. If A is already ill-conditioned, A^TA can be catastrophically inaccurate. QR decomposition factors A = QR, reducing the normal equations to the well-conditioned triangular system Rx̂ = Q^Tb, solvable by back-substitution without squaring the condition number. The normal equation form A^TAx̂ = A^Tb is conceptually cleaner for understanding why least squares works, but in practice QR is the numerically stable method.
Question 5 Short Answer
Why do the normal equations A^TAx̂ = A^Tb always have at least one solution, even when the original system Ax = b has none?
Think about your answer, then reveal below.
Model answer: The normal equations are derived by multiplying both sides of Ax = b by A^T. The resulting system A^TAx̂ = A^Tb is always consistent because A^Tb always lies in the column space of A^TA. Geometrically: the right-hand side A^Tb is always reachable by the matrix A^TA. More directly, the normal equations express the orthogonality condition that the residual b − Ax̂ be perpendicular to the column space of A — and there is always at least one point in the column space closest to any given vector b. When A has full column rank, A^TA is invertible and the solution is unique; when A has linearly dependent columns, there are infinitely many solutions but at least one always exists.
The key insight is that going from Ax = b (inconsistent) to A^TAx̂ = A^Tb (always consistent) is not a coincidence — it is precisely the point of the construction. Multiplying by A^T projects the equation into a space where it can always be satisfied. The geometric language makes this clearest: 'find the projection of b onto the column space of A' always has an answer (the projection always exists), even when 'find x such that Ax = b exactly' does not.