← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Dummy Variables and Categorical Regressors

College Depth 111 in the knowledge graph ☐ I know this ☆ Set as goal

64topics build on this

577prerequisites beneath it

See this on the map →

Interpreting Regression Coefficients Multiple Regression→→Difference-in-Differences Fixed Effects Models

Core Idea

A dummy (indicator) variable takes values 0 or 1 to represent group membership, allowing categorical variables to enter linear regression. The coefficient on a dummy captures the mean difference in y between that group and the omitted reference group, holding all other regressors constant. For a variable with k categories, include k−1 dummies to avoid perfect multicollinearity (the dummy variable trap). Interaction terms between a dummy and a continuous variable allow the slope on the continuous variable to differ across groups, enabling tests of whether relationships are heterogeneous.

How It's Best Learned

Run a gender wage gap regression with and without control variables to see how the dummy coefficient changes — this illustrates both interpretation and the role of controls in reducing omitted variable bias.

Common Misconceptions

Including all k dummies creates perfect multicollinearity with the intercept (dummy variable trap) — always drop one.
The reference category matters for the coefficient values but not for the implied predicted means or their differences.

Explainer

A dummy variable (also called an indicator variable) is a 0/1 switch that lets categorical information enter a regression. Suppose you want to know whether women earn less than men, controlling for education and experience. You can't use "gender" directly as a number — it has no natural scale. Instead, you create a variable that equals 1 if female and 0 if male. The OLS regression then estimates a separate intercept shift for the female group: the coefficient on the dummy tells you the average wage gap, holding education and experience constant. This is the core insight — the dummy converts a group membership question into a coefficient interpretation you already know from multiple regression.

The reference category is the group coded as 0, and every dummy coefficient is interpreted as a difference *relative to that baseline*. If you have three education categories — high school, college, and graduate — you'd include two dummies and leave one out. The omitted category becomes the reference. A coefficient of +$15,000 on the college dummy means college graduates earn $15,000 more on average than high school graduates (the reference), holding other variables constant. You could make any category the reference; the predicted values and group differences don't change, only the coefficient labels.

The dummy variable trap is what happens when you include all k dummies for a k-category variable. Each observation must belong to exactly one category, so the dummies always sum to 1 — exactly equal to the intercept column. This perfect multicollinearity means the matrix (X'X) cannot be inverted, and OLS has no unique solution. The fix is mechanical: always omit one category. This is not a data problem — it's a modeling rule. Software packages typically drop a category automatically, but you should know which one was dropped to interpret coefficients correctly.

Interaction terms extend dummy variables into testing whether the *slope* of a continuous variable differs across groups. If you interact the female dummy with years of education, the interaction coefficient estimates how much the return to education differs for women versus men. A positive interaction means education pays off more for women; a negative one means less. Without the interaction, your model assumes the slope on education is identical for both groups — the dummy only shifts the intercept. With the interaction, you allow the line itself to have a different angle. This is a powerful generalization: the dummy controls for level differences, while the interaction term captures heterogeneous relationships, enabling you to test whether any relationship you've estimated is the same across groups.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Independence of Events → Sampling Distributions → Standard Error of Estimators → Hypothesis Testing: Framework and Logic → P-values and Statistical Significance → Effect Size and Practical Significance → Hypothesis Testing: Framework and Logic → Z-Tests and T-Tests for Means → One-Sample Z-Test for Means → One-Sample and Two-Sample T-Tests → Inference in Linear Regression → Prediction Intervals in Regression → Linear Regression Basics → Residuals and Goodness of Fit (R²) → Simple (Bivariate) OLS Regression → Classical OLS Assumptions (Gauss-Markov) → Multiple Regression → Interpreting Regression Coefficients → Dummy Variables and Categorical Regressors

Longest path: 112 steps · 577 total prerequisite topics

Prerequisites (2)

Interpreting Regression Coefficientshard Multiple Regressionhard

Leads To (2)

Difference-in-Differenceshard Fixed Effects Modelshard