Linear Regression for Social Science

Graduate Depth 72 in the knowledge graph I know this Set as goal
Unlocks 46 downstream topics
regression correlation coefficients assumptions

Core Idea

Applies linear regression modeling to social science research questions, covering ordinary least squares estimation, interpretation of regression coefficients, model diagnostics, and addressing violations of assumptions. Emphasizes theoretical justification and causal thinking in observational research.

How It's Best Learned

Estimate regressions on social science datasets, create visualizations of relationships, test assumption violations, practice interpreting coefficients for different outcome scales.

Common Misconceptions

Explainer

When you apply linear regression to social science data, the mechanics are the same as in statistics or mathematics — fit a line through data by minimizing the sum of squared residuals (OLS). But social science adds a layer that statistics courses often skip: *what do the coefficients mean, and when can you call them causal?*

The OLS estimate of a coefficient represents the average change in the outcome associated with a one-unit increase in the predictor, *holding all other included variables constant*. That phrase "holding constant" is doing a lot of work. It does not mean the other variables are actually fixed in reality — it means you are comparing observations that differ only in the predictor of interest given the model's specification. If you have omitted a variable that is correlated with both the predictor and the outcome (a confounder), your estimates are biased.

This is why social scientists obsess over identification — the process of isolating causal effects from correlational noise. A significant p-value tells you the coefficient is probably nonzero in the population, but it says nothing about whether the relationship is causal. Two of the most common errors in reading regression output are (1) treating significant associations as causal effects, and (2) assuming that adding more controls always improves inference. The second is particularly dangerous: some variables, called colliders, are *caused by* both the treatment and the outcome. Controlling for a collider opens a spurious association that was not present before — adding it to the regression makes things worse.

The assumptions underlying OLS — linearity, homoskedasticity, no perfect multicollinearity, independence, and exogeneity — each have a diagnostic test and a remedy. Heteroskedasticity (non-constant variance) inflates or deflates standard errors; robust standard errors address this. Multicollinearity (highly correlated predictors) does not bias coefficients but inflates their standard errors, making estimates unstable. Endogeneity — when a predictor is correlated with the error term, often due to omitted variables — produces biased coefficients and is the hardest assumption to fix without an instrumental variable or natural experiment.

R² measures how much variance in the outcome the model explains, and high R² feels satisfying. But R² can be increased mechanically by adding variables, even irrelevant ones (adjusted R² penalizes for this). In causal social science, a model with R² = 0.15 but a clean identification strategy is far more credible than R² = 0.85 with ambiguous causal structure. Focus on whether the coefficient of interest has a defensible causal interpretation, not on whether the model explains lots of variance.

What did you take from this?

Topics in reflective domains aren't scored by quiz answers. Read, reflect, and mark when you've thought it through.

Quiz me anyway →

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesLinear Regression for Social Science

Longest path: 73 steps · 353 total prerequisite topics

Prerequisites (6)

Leads To (14)