Questions: Differential Item Functioning and Test Bias Detection
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A math test item is answered correctly by 72% of male examinees and 54% of female examinees. A researcher concludes the item shows DIF. What is wrong with this reasoning?
ADIF can only be detected using IRT, not raw score comparisons, so the method is invalid
BDIF requires showing that the group performance difference persists after conditioning on ability — the score gap alone could reflect genuine group differences in the construct, not item-specific bias
CThe score gap is too small to constitute DIF; a gap of at least 25 percentage points is needed
DDIF analysis requires the two groups to be matched in sample size before comparison
This is the central conceptual error in DIF analysis. A raw score gap between groups tells you nothing about DIF because it could simply reflect real differences in the underlying construct (math ability). DIF requires showing that, at *matched* ability levels, the item still behaves differently across groups. Without conditioning on ability, you cannot separate legitimate construct differences from item-specific bias. The conditioning step is what defines DIF.
Question 2 Multiple Choice
A test of English language proficiency includes an item that shows DIF against non-native speakers. Content reviewers find that the item uses a grammatical construction that is genuinely difficult for non-native speakers at any given proficiency level because it targets a specific feature of advanced English grammar. How should this DIF be classified?
AAs bias requiring immediate removal — any DIF against a minority group is by definition biased
BPotentially as legitimate DIF — the differential functioning may reflect the target construct itself rather than irrelevant content
CAs negligible — DIF only matters when it affects groups by more than one standard deviation
DAs an IRT calibration error requiring the item to be recalibrated using the non-native speaker subsample
DIF is a statistical finding, not automatic evidence of bias. The DIF here could be legitimate if the grammatical feature being tested is genuinely part of English language proficiency — in which case non-native speakers at the same overall proficiency level might genuinely differ on this specific aspect of the construct. Bias requires DIF due to *construct-irrelevant* content. Content expert review is the essential next step: DIF identifies items for investigation; it does not determine by itself whether the differential functioning is appropriate or problematic.
Question 3 True / False
A group scoring significantly lower on an overall test than another group provides sufficient statistical evidence that specific items in the test show DIF against the lower-scoring group.
TTrue
FFalse
Answer: False
Overall group score differences and DIF are logically independent. A lower-scoring group might simply have lower levels of the construct being measured — which is not DIF. DIF requires showing that specific items perform differently for different groups at the *same* ability level. You can have large overall group differences with no DIF on any individual item (if the group difference reflects the construct uniformly), or you can have items with DIF even when overall group means are identical.
Question 4 True / False
The Mantel-Haenszel method detects DIF by stratifying examinees into ability-matched subgroups and testing whether each item's difficulty is consistent across demographic groups within each stratum.
TTrue
FFalse
Answer: True
This is an accurate description of the Mantel-Haenszel approach. By creating subgroups of examinees matched on overall performance (as a proxy for ability), the method controls for ability before comparing item performance across demographic groups. This non-parametric approach does not require fitting an IRT model and remains widely used because it is computationally straightforward and interpretable. It directly implements the 'conditioning on ability' logic that defines DIF detection.
Question 5 Short Answer
Why is 'conditioning on ability' the essential step in DIF detection, and what does the analysis fail to show without it?
Think about your answer, then reveal below.
Model answer: DIF is defined as differential item functioning at matched ability levels — the same item behaving differently for examinees who are otherwise equivalent on the target construct. Without conditioning on ability, a group performance difference on an item cannot be distinguished from a genuine group difference in the trait being measured. If you simply compare raw scores without matching on ability, you cannot tell whether the item is biased or whether the groups simply differ in the construct. Conditioning on ability isolates the item-specific effect from the construct-level effect.
The intuition is that DIF asks a counterfactual: 'If I could compare two examinees with identical ability but from different demographic groups, would this item treat them identically?' That counterfactual requires ability-matching. Without it, you cannot answer the DIF question at all — you can only observe that groups differ, which is a different (and much less interesting) finding.