R-Squared: Goodness of Fit

Graduate Depth 86 in the knowledge graph I know this Set as goal
Unlocks 2 downstream topics
model-fit goodness-of-fit

Core Idea

R² = 1 - (RSS / TSS) measures the fraction of variation in Y explained by regressors, ranging from 0 to 1. Higher values indicate better fit, but R² cannot determine whether the model is causal or whether omitted variables bias estimates.

Explainer

From your study of simple linear regression, you know that OLS finds the line that minimizes the sum of squared residuals — the vertical distances between the data points and the fitted line. R² is built from two quantities derived from those residuals. Total Sum of Squares (TSS) is the total variation in Y around its mean: how spread out the outcome variable is before you add any predictors. Residual Sum of Squares (RSS) is the variation left over after fitting your model — the variation your regressors failed to explain. R² = 1 - (RSS/TSS) is then simply the fraction of total variation that the model accounts for.

The formula has a clean geometric interpretation. If your model explained nothing, RSS would equal TSS and R² = 0. If your model explained everything perfectly, RSS = 0 and R² = 1. In practice R² lives between these extremes, and interpreting it is context-dependent. A model explaining household income from age and education might achieve R² = 0.35 and be considered quite good, because income is driven by dozens of unobserved factors. A model predicting tomorrow's temperature from yesterday's temperature might achieve R² = 0.97. The benchmark is never "how close to 1?" but rather "how much variation was plausibly explainable by these specific predictors?"

The most important limitation of R² is that it rises mechanically whenever you add a variable — even a completely irrelevant one. Because OLS fits the sample data, adding noise variables never hurts in-sample fit. A model with 50 predictors will always have higher R² than a model with 5 predictors on the same data, even if 45 of those predictors are uncorrelated with Y in the population. This motivates the adjusted R², which penalizes for the number of parameters, and cross-validation methods that assess out-of-sample fit.

The deeper limitation is that R² says nothing about causality. A model with R² = 0.95 might be severely confounded, with biased coefficient estimates, if key variables are omitted or endogenous. Conversely, a randomized experiment might produce a regression with R² = 0.02, but the estimate of the treatment effect is unbiased and causally interpretable. R² is a measure of descriptive fit, not of the quality of causal identification. This is why econometricians often care more about whether their estimates are consistent and unbiased than about whether R² is high.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesNormal DistributionCentral Limit TheoremConfidence Intervals for MeansZ-Tests and T-Tests for MeansOne-Sample Z-Test for MeansOne-Sample and Two-Sample T-TestsHypothesis Testing in RegressionSpecification Error: RESET TestWhite Test and Detection of HeteroskedasticityGeneralized Least Squares (GLS) for Non-Spherical ErrorsFeasible GLS (FGLS) with Estimated Covariance StructureQuasi-Maximum Likelihood EstimationSimple Linear Regression EstimationR-Squared: Goodness of Fit

Longest path: 87 steps · 497 total prerequisite topics

Prerequisites (1)

Leads To (1)