Applies linear regression modeling to social science research questions, covering ordinary least squares estimation, interpretation of regression coefficients, model diagnostics, and addressing violations of assumptions. Emphasizes theoretical justification and causal thinking in observational research.
Estimate regressions on social science datasets, create visualizations of relationships, test assumption violations, practice interpreting coefficients for different outcome scales.
When you apply linear regression to social science data, the mechanics are the same as in statistics or mathematics — fit a line through data by minimizing the sum of squared residuals (OLS). But social science adds a layer that statistics courses often skip: *what do the coefficients mean, and when can you call them causal?*
The OLS estimate of a coefficient represents the average change in the outcome associated with a one-unit increase in the predictor, *holding all other included variables constant*. That phrase "holding constant" is doing a lot of work. It does not mean the other variables are actually fixed in reality — it means you are comparing observations that differ only in the predictor of interest given the model's specification. If you have omitted a variable that is correlated with both the predictor and the outcome (a confounder), your estimates are biased.
This is why social scientists obsess over identification — the process of isolating causal effects from correlational noise. A significant p-value tells you the coefficient is probably nonzero in the population, but it says nothing about whether the relationship is causal. Two of the most common errors in reading regression output are (1) treating significant associations as causal effects, and (2) assuming that adding more controls always improves inference. The second is particularly dangerous: some variables, called colliders, are *caused by* both the treatment and the outcome. Controlling for a collider opens a spurious association that was not present before — adding it to the regression makes things worse.
The assumptions underlying OLS — linearity, homoskedasticity, no perfect multicollinearity, independence, and exogeneity — each have a diagnostic test and a remedy. Heteroskedasticity (non-constant variance) inflates or deflates standard errors; robust standard errors address this. Multicollinearity (highly correlated predictors) does not bias coefficients but inflates their standard errors, making estimates unstable. Endogeneity — when a predictor is correlated with the error term, often due to omitted variables — produces biased coefficients and is the hardest assumption to fix without an instrumental variable or natural experiment.
R² measures how much variance in the outcome the model explains, and high R² feels satisfying. But R² can be increased mechanically by adding variables, even irrelevant ones (adjusted R² penalizes for this). In causal social science, a model with R² = 0.15 but a clean identification strategy is far more credible than R² = 0.85 with ambiguous causal structure. Focus on whether the coefficient of interest has a defensible causal interpretation, not on whether the model explains lots of variance.
Topics in reflective domains aren't scored by quiz answers. Read, reflect, and mark when you've thought it through.