A researcher runs OLS regression of annual income on years of education and obtains R² = 0.82. She concludes that education strongly causes higher income. What is the fundamental error in this reasoning?
AR² above 0.8 is implausibly high, suggesting a coding error
BOLS minimizes absolute errors, not squared errors, so R² measures the wrong criterion
CR² measures goodness of fit — how well education predicts income in the sample — but causality requires E(u|X)=0, which cannot be established from the regression output alone
DThe intercept must be statistically significant for causal inference to be valid
R² tells you what share of Y's variance is explained by X in your sample — a prediction quality measure. Causality requires exogeneity: E(u|X) = 0, meaning no unobserved factors correlated with both education and income. This assumption is about the data-generating process, not the fit of the regression. People with more education may also differ in ability, family background, and networks (all in u) — so the slope may capture those effects, not the causal impact of education alone. High R² is perfectly compatible with severe omitted variable bias.
Question 2 Multiple Choice
What is the correct interpretation of the OLS slope estimator β̂₁ = Cov(X,Y) / Var(X)?
AThe fraction of the variation in Y that is explained by X
BThe probability that a one-unit increase in X causes Y to increase
CThe average change in Y associated with a one-unit change in X, measuring how much Y co-moves with X scaled by X's own variability
DThe average value of X when Y equals zero
β̂₁ = Cov(X,Y)/Var(X) computes the joint variation between X and Y (Cov), then scales it by how much X varies on its own (Var(X)) to get a per-unit-of-X number. Concretely: if X is years of schooling and Y is wages, β̂₁ is the average dollar increase in wages for each additional year of schooling in the sample. Option A describes R², not the slope. Option B is a causal statement that requires additional assumptions. Option D describes the intercept β̂₀, not the slope.
Question 3 True / False
OLS estimation of β̂₁ and β̂₀ requires that the residuals are normally distributed.
TTrue
FFalse
Answer: False
Normal distribution of residuals (or equivalently, of the error term u) is required for the t-statistics and F-statistics used in inference (hypothesis testing and confidence intervals) to have their claimed distributions in small samples. But the OLS estimators β̂₁ = Cov(X,Y)/Var(X) and β̂₀ = Ȳ − β̂₁X̄ are just algebraic formulas — they can be computed and are unbiased under the Gauss-Markov assumptions without any normality requirement. Students often conflate the conditions needed for estimation with those needed for inference.
Question 4 True / False
A high R² value in a regression of Y on X means that X explains a large share of the variation in Y, but does not by itself establish that X causes Y.
TTrue
FFalse
Answer: True
R² is purely a goodness-of-fit measure: R² = 1 − SSR/SST = 1 − (unexplained variance)/(total variance). A regression of height on shoe size has high R² because they are strongly correlated, but shoe size does not cause height — both are driven by genetics and nutrition. Causality requires the exogeneity condition E(u|X) = 0, meaning X is uncorrelated with all other determinants of Y. No amount of predictive fit can substitute for this structural condition.
Question 5 Short Answer
Why can a regression with high R² still fail to identify a causal effect of X on Y? What additional condition is required, and why is that condition not visible in the regression output?
Think about your answer, then reveal below.
Model answer: High R² means X accounts for much of the variation in Y in the sample, but the variation being explained may come from confounders — variables correlated with both X and Y that are omitted from the regression and absorbed into the error term u. For X to have a causal interpretation, we need E(u|X) = 0 (exogeneity): no systematic relationship between X and the unobserved determinants of Y. This condition is not visible in the regression output because it is a claim about the data-generating process — the unmeasured variables — not about the data we observe. R² can be very high even when u and X are strongly correlated due to omitted variables.
The distinction between prediction and causation is the single most important conceptual gap in applied regression. R² measures fit; the exogeneity condition is what allows a slope coefficient to be interpreted as a causal effect. Every sophisticated regression strategy — instrumental variables, regression discontinuity, difference-in-differences — is essentially a way to create or exploit situations where exogeneity (or something close to it) plausibly holds.