A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Assumptions in Linear Regression

College Depth 83 in the knowledge graph ☐ I know this ☆ Set as goal

199topics build on this

328prerequisites beneath it

Linear Regression and Least Squares Estimation Introduction to Multiple Linear Regression→→Regression Diagnostics and Residual Analysis

Core Idea

Standard linear regression assumes: linearity (relationship is linear), independence of observations, homoscedasticity (constant error variance), and normality of errors. Violations affect validity of inferential procedures. Residual plots help diagnose violations.

How It's Best Learned

Create residual plots for various datasets and identify assumption violations. Compare behavior of regression under satisfied vs. violated assumptions. Use transformations to stabilize variance or linearize relationships.

Common Misconceptions

Assuming regression works automatically without checking assumptions. Thinking normality is most important (independence violations are often more problematic). Fitting regression to inherently nonlinear relationships and ignoring residual patterns.

Explainer

From your prerequisite on linear regression, you know how to fit a line by minimizing squared residuals and how to read off a coefficient like "for each additional year of education, income increases by $3,200." That fitting procedure always produces a line — it will always give you numbers. But the p-values, confidence intervals, and standard errors you compute alongside those coefficients rest on four assumptions that the data may or may not satisfy. Understanding assumptions is about knowing when those inferential statements are trustworthy, not about when regression "works" mechanically.

The four assumptions are often remembered by the acronym LINE. Linearity means the true relationship between the predictors and the outcome is additive and linear — if it curves, your coefficients are biased estimates of a nonlinear truth. Independence means each observation's error is unrelated to every other — this is violated in time-series data (where yesterday's error predicts today's), in clustered data (where students within the same school share unmeasured factors), and anywhere that repeated measurements come from the same unit. Independence violations are often the most damaging, yet they leave no trace in a standard residual plot. Homoscedasticity means error variance is constant across the range of fitted values — if higher predicted values also have larger residuals (a "fan" pattern), your standard errors are wrong in ways that can either inflate or deflate significance. Normality of errors is the mildest assumption: the Central Limit Theorem makes regression estimates approximately normal even when residuals are not, especially for large samples.

The primary diagnostic tool is the residual plot — a scatterplot of fitted values (x-axis) against residuals (y-axis). A well-satisfied model produces a cloud of points with no visible pattern: random scatter centered at zero, constant spread, no curves. Curved patterns indicate violated linearity; a fan or funnel shape indicates heteroscedasticity; systematic bands or waves often indicate autocorrelation. A QQ-plot of residuals against theoretical normal quantiles checks the normality assumption: points should fall on a straight diagonal line.

When you find violations, you have options rather than a dead end. A curved residual pattern often calls for a transformation of a predictor (log x, x²) or the outcome (log y for multiplicative relationships). Heteroscedasticity often responds to a log transformation of y, or to using robust standard errors that are valid without the constant-variance assumption. Autocorrelation requires modeling the error structure directly — Durbin-Watson tests detect it, and time-series methods correct for it. Independence violations from clustering are addressed by clustered standard errors or mixed models. The assumptions are not pass-or-fail gates; they are diagnostic signals that tell you what model refinements are needed.

A practical framing: think of the four assumptions as ordered by the severity of their violation. Nonlinearity produces biased coefficients — the fundamental estimates are wrong. Independence violations corrupt standard errors in hard-to-predict directions. Heteroscedasticity inflates or deflates standard errors but leaves coefficients unbiased. Normality violations matter primarily for small samples and are the first to forgive. Every regression analysis should at minimum produce a residual-vs-fitted plot and a QQ-plot before any inference is reported; skipping this step is not efficiency, it is silent model misspecification.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Vectors in Two Dimensions → Vector Operations: Addition, Subtraction, and Scalar Multiplication → Dot Product (Inner Product in R^n) → Inner Product Spaces → Orthogonality → Orthogonal Projections → Orthogonal Projections and Least Squares Approximation → Linear Regression and Least Squares Estimation → Introduction to Multiple Linear Regression → Assumptions in Linear Regression

Longest path: 84 steps · 328 total prerequisite topics

Prerequisites (2)

Linear Regression and Least Squares Estimationhard Introduction to Multiple Linear Regressionsoft

Leads To (1)

Regression Diagnostics and Residual Analysishard