Regression Diagnostics: Checking Assumptions and Violations

Graduate Depth 73 in the knowledge graph I know this Set as goal
Unlocks 3 downstream topics
regression diagnostics assumptions robustness

Core Idea

Ordinary least squares regression assumes linearity, homoscedasticity, independence, normality, and no multicollinearity. Real data often violate these. Diagnostic techniques—residual plots, tests, robust standard errors—detect violations and guide corrections.

Explainer

From linear regression, you know that OLS minimizes the sum of squared residuals to find the best-fitting line. What "best" means depends critically on a set of assumptions built into the math. When those assumptions hold, OLS is BLUE — the Best Linear Unbiased Estimator. When they fail, your coefficient estimates may still be unbiased, but your standard errors — and therefore your p-values, confidence intervals, and hypothesis tests — may be completely wrong. Regression diagnostics are the practice of checking which assumptions hold and deciding what to do when they don't.

The five core assumptions are worth knowing as a checklist. Linearity means the relationship between X and Y is actually linear — violations show up as systematic curves in a residuals-vs-fitted plot. Homoscedasticity means the variance of residuals is constant across all values of X — violations (called heteroscedasticity) make residual plots fan out or funnel in. Independence means observations are not correlated with each other — violated by clustered data (students within schools), panel data (repeated measures), or spatial data. Normality of residuals is the least consequential assumption for large samples by the central limit theorem, but matters in small samples when you need accurate p-values. No multicollinearity means predictors aren't so highly correlated that the model can't distinguish their separate effects — diagnosed using the variance inflation factor (VIF); high VIF inflates standard errors.

The primary diagnostic tool is the residual plot: a scatterplot of residuals against fitted values (or against each predictor). Patterns in this plot are informative: a random cloud means the linearity and homoscedasticity assumptions look okay; a curve suggests a nonlinear relationship you've missed; a fan shape signals heteroscedasticity. A Q-Q plot of residuals against the normal distribution diagnoses normality — points deviating from the 45-degree line indicate non-normality. For influential observations, Cook's distance measures how much coefficient estimates would change if a particular point were removed; high-leverage points can disproportionately determine your results.

The good news is that violations don't always require starting over — they often have tractable remedies. Heteroscedasticity can be addressed with robust standard errors (also called sandwich estimators or HC standard errors), which give valid inference even when variance isn't constant, without changing coefficient estimates. Nonlinearity can be addressed by adding polynomial terms, log-transforming variables, or including interaction terms. Multicollinearity can sometimes be reduced by centering variables or reconsidering model specification. Clustered observations call for cluster-robust standard errors or multilevel models. The diagnostic step tells you what's wrong; it doesn't automatically tell you the fix, which depends on understanding *why* the violation is occurring in your data. That interpretive step connects back to your knowledge of measurement validity and research design — violations often signal substantive modeling problems, not just statistical technicalities.

What did you take from this?

Topics in reflective domains aren't scored by quiz answers. Read, reflect, and mark when you've thought it through.

Quiz me anyway →

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesLinear Regression for Social ScienceRegression Diagnostics: Checking Assumptions and Violations

Longest path: 74 steps · 355 total prerequisite topics

Prerequisites (2)

Leads To (1)