Questions: Dummy Variables and Categorical Regressors
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
You are estimating a wage regression with a 4-category variable: season of birth (spring, summer, fall, winter). How many dummy variables should you include, and why?
A4 dummies — one for each season, to capture each season's full effect
B3 dummies — include all but one, which becomes the reference category
C2 dummies — one for each pair of seasons (spring/summer vs. fall/winter)
D1 dummy — a single variable can encode all 4 categories using values 0, 1, 2, 3
For a k-category variable, include k−1 = 3 dummies. Including all 4 creates perfect multicollinearity: the four dummy columns always sum to 1, exactly matching the intercept column, making (X'X) singular and OLS unsolvable — the dummy variable trap. The omitted season becomes the reference category; each included dummy's coefficient measures the wage difference from that reference. A single numeric variable (0, 1, 2, 3) would incorrectly impose an ordering and equal spacing between seasons.
Question 2 Multiple Choice
A wage regression includes a female dummy D (1=female, 0=male) and years of education. The interaction term D × Education has a coefficient of +$800. What does this mean?
AWomen earn $800 more than men on average, regardless of education
BEach additional year of education is worth $800 for both men and women
CEach additional year of education is worth $800 more for women than for men
DThe gender wage gap closes by $800 for each year of education women complete
An interaction term D × Education allows the slope on education to differ by gender. The coefficient on the interaction (+$800) means that for women (D=1), each additional year of education raises wages by $800 more than it does for men (D=0). The main dummy coefficient still captures the baseline intercept difference (gender gap at zero education), while the interaction captures slope heterogeneity. Without the interaction, the model would force the return to education to be identical for both groups.
Question 3 True / False
Including most k dummy variables for a k-category variable alongside an intercept term is fine in OLS regression, as long as your software is modern enough to handle the multicollinearity.
TTrue
FFalse
Answer: False
This is the dummy variable trap. The k dummies always sum to 1 for every observation — identically equal to the intercept column (which is a column of 1s). This is perfect, not merely high, multicollinearity: the matrix X'X is literally singular and cannot be inverted. OLS has no unique solution. This is not a computational problem that better software can overcome — it is a mathematical impossibility. The fix is always to drop one category. Software packages handle this automatically, but understanding the underlying reason is essential for correctly interpreting which category was omitted.
Question 4 True / False
Changing the reference category in a dummy variable regression changes the fitted values and the predicted mean outcomes for each group.
TTrue
FFalse
Answer: False
The choice of reference category is arbitrary and affects only the coefficient labels, not the model's predictions or implied group means. Switching from 'high school' to 'college' as the reference changes which group the other coefficients are measured against, but the predicted wage for every individual and the implied mean for every group remain identical. The differences between groups are exactly the same regardless of reference category — what changes is only the baseline from which differences are expressed. This is why the reference choice is a labeling convention, not a substantive modeling decision.
Question 5 Short Answer
What is the 'dummy variable trap,' and why does it make OLS estimation mathematically impossible?
Think about your answer, then reveal below.
Model answer: The dummy variable trap occurs when all k dummies for a k-category variable are included along with an intercept. Because every observation belongs to exactly one category, the dummy values always sum to 1 — identical to the intercept column. This creates perfect multicollinearity: one column of the design matrix X is an exact linear combination of others. The matrix X'X becomes singular (non-invertible), so the OLS formula β̂ = (X'X)⁻¹X'y has no solution. The fix is to always omit one category, making it the reference group against which all others are measured.
Understanding the dummy variable trap matters because it explains the k−1 rule at a deeper level than 'just drop one.' It also helps with more complex settings: any time a set of variables must sum to a constant (like shares that sum to 1, or time dummies plus a constant), the same issue arises. Recognizing perfect multicollinearity as a structural feature of the design — not a data quality problem — lets you diagnose and fix it correctly.