Dummy Variables and Categorical Regressors

College Depth 77 in the knowledge graph I know this Set as goal
Unlocks 33 downstream topics
dummy-variables categorical indicator interaction-terms

Core Idea

A dummy (indicator) variable takes values 0 or 1 to represent group membership, allowing categorical variables to enter linear regression. The coefficient on a dummy captures the mean difference in y between that group and the omitted reference group, holding all other regressors constant. For a variable with k categories, include k−1 dummies to avoid perfect multicollinearity (the dummy variable trap). Interaction terms between a dummy and a continuous variable allow the slope on the continuous variable to differ across groups, enabling tests of whether relationships are heterogeneous.

How It's Best Learned

Run a gender wage gap regression with and without control variables to see how the dummy coefficient changes — this illustrates both interpretation and the role of controls in reducing omitted variable bias.

Common Misconceptions

Explainer

A dummy variable (also called an indicator variable) is a 0/1 switch that lets categorical information enter a regression. Suppose you want to know whether women earn less than men, controlling for education and experience. You can't use "gender" directly as a number — it has no natural scale. Instead, you create a variable that equals 1 if female and 0 if male. The OLS regression then estimates a separate intercept shift for the female group: the coefficient on the dummy tells you the average wage gap, holding education and experience constant. This is the core insight — the dummy converts a group membership question into a coefficient interpretation you already know from multiple regression.

The reference category is the group coded as 0, and every dummy coefficient is interpreted as a difference *relative to that baseline*. If you have three education categories — high school, college, and graduate — you'd include two dummies and leave one out. The omitted category becomes the reference. A coefficient of +$15,000 on the college dummy means college graduates earn $15,000 more on average than high school graduates (the reference), holding other variables constant. You could make any category the reference; the predicted values and group differences don't change, only the coefficient labels.

The dummy variable trap is what happens when you include all k dummies for a k-category variable. Each observation must belong to exactly one category, so the dummies always sum to 1 — exactly equal to the intercept column. This perfect multicollinearity means the matrix (X'X) cannot be inverted, and OLS has no unique solution. The fix is mechanical: always omit one category. This is not a data problem — it's a modeling rule. Software packages typically drop a category automatically, but you should know which one was dropped to interpret coefficients correctly.

Interaction terms extend dummy variables into testing whether the *slope* of a continuous variable differs across groups. If you interact the female dummy with years of education, the interaction coefficient estimates how much the return to education differs for women versus men. A positive interaction means education pays off more for women; a negative one means less. Without the interaction, your model assumes the slope on education is identical for both groups — the dummy only shifts the intercept. With the interaction, you allow the line itself to have a different angle. This is a powerful generalization: the dummy controls for level differences, while the interaction term captures heterogeneous relationships, enabling you to test whether any relationship you've estimated is the same across groups.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesNormal DistributionClassical OLS Assumptions (Gauss-Markov)Multiple RegressionInterpreting Regression CoefficientsDummy Variables and Categorical Regressors

Longest path: 78 steps · 402 total prerequisite topics

Prerequisites (2)

Leads To (2)