A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Endogeneity

College Depth 115 in the knowledge graph ☐ I know this ☆ Set as goal

75topics build on this

584prerequisites beneath it

Classical OLS Assumptions (Gauss-Markov)Omitted Variable Bias→→Endogenous Regressors: Bias and Consequences Instrumental Variables +2 more

Core Idea

Endogeneity is a general term for any situation where a regressor is correlated with the error term, making OLS biased and inconsistent. There are three main sources: omitted variable bias (a confound is excluded from the model), simultaneity (y and x are jointly determined, as when price and quantity are both endogenous in a supply-demand system), and measurement error (x is measured with noise, attenuating its coefficient toward zero via 'attenuation bias'). Endogeneity is the central identification problem in applied economics, and most advanced methods — instrumental variables, panel fixed effects, regression discontinuity, difference-in-differences — are designed to address specific forms of it.

How It's Best Learned

Work through three separate examples, one for each source of endogeneity, and derive the direction of bias for each. The supply-demand simultaneity example is essential for macroeconomics applications.

Common Misconceptions

Endogeneity is not about the dependent variable being 'determined inside a model'; it specifically means Cov(xⱼ, u) ≠ 0.
Measurement error in y does not cause endogeneity; only measurement error in x creates attenuation bias.

Explainer

From your study of OLS assumptions, you know that the zero-conditional-mean assumption E(u|X) = 0 is what makes OLS unbiased. Endogeneity is the collective name for anything that violates this assumption — any situation where your regressor X is correlated with the error term u. When Cov(X, u) ≠ 0, OLS does not recover the true causal effect; instead it picks up a blend of the causal effect and the confounding relationship between X and u. The bias does not shrink with sample size — endogeneity is a consistency problem, not just a precision problem.

The three sources each have a distinct mechanism. You already understand the first from omitted variable bias: if a variable Z affects Y and is correlated with X but is left out of the model, Z ends up in the error term, making u correlated with X. The classic example is estimating the returns to education: ability affects wages and is correlated with education, so omitting ability inflates the estimated education coefficient. The direction of bias follows a simple formula: the product of the sign of (Z's effect on Y) and the sign of (Z's correlation with X). Second, simultaneity arises when X and Y jointly determine each other. Trying to estimate a demand curve using market data is the canonical case: both price and quantity are simultaneously set by the intersection of supply and demand, so price is correlated with the demand error. Running OLS on price and quantity gives neither a supply curve nor a demand curve — it gives a jumbled blend of both.

The third source, measurement error in X, is subtler but important. Suppose you're estimating the effect of true ability (X*) on wages, but you only observe test scores (X = X* + v) where v is random noise. The noise v ends up in the error term and is negatively correlated with the mismeasured X, because X absorbs part of v while the remaining v creates negative covariance with u. The result is attenuation bias: your estimated coefficient is biased toward zero — you understate the true relationship. The magnitude of attenuation equals the reliability ratio, the fraction of X's variance that is true signal rather than noise.

All the major tools of applied econometrics — instrumental variables, difference-in-differences, regression discontinuity, panel fixed effects — exist specifically to address one or more forms of endogeneity. Instrumental variables finds a variable Z that shifts X but has no direct effect on Y and no correlation with the error, allowing you to use only the exogenous variation in X for identification. Panel fixed effects remove time-invariant omitted variables by differencing out each unit's average. Understanding endogeneity is therefore not an isolated topic — it is the central diagnostic question behind every causal regression design. Before trusting any OLS estimate, ask: is there any reason my regressor might be correlated with the error? If yes, identify the source and select the appropriate remedy.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Independence of Events → Sampling Distributions → Standard Error of Estimators → Hypothesis Testing: Framework and Logic → P-values and Statistical Significance → Effect Size and Practical Significance → Hypothesis Testing: Framework and Logic → Z-Tests and T-Tests for Means → One-Sample Z-Test for Means → One-Sample and Two-Sample T-Tests → Inference in Linear Regression → Prediction Intervals in Regression → Linear Regression Basics → Residuals and Goodness of Fit (R²) → Simple (Bivariate) OLS Regression → Classical OLS Assumptions (Gauss-Markov) → Multiple Regression → Interpreting Regression Coefficients → Hypothesis Testing in Regression → F-Test and Joint Significance → R-Squared and Model Fit → Omitted Variable Bias → Endogeneity

Longest path: 116 steps · 584 total prerequisite topics

Prerequisites (2)

Omitted Variable Biashard Classical OLS Assumptions (Gauss-Markov)hard

Leads To (4)

Endogenous Regressors: Bias and Consequenceshard Instrumental Variableshard Panel Data: Structure and Advantageshard Selection Biashard