Questions: Test Bias Detection Methods and Statistical Approaches
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A test developer applies the Mantel-Haenszel procedure to an item and finds no significant DIF. A measurement colleague argues the item could still be biased. Under what condition would the colleague be correct?
AIf the item has low test-retest reliability, MH produces inflated false-negative rates
BIf the DIF effect reverses direction across ability levels (non-uniform DIF), MH would not detect it
CThe colleague is wrong; a non-significant MH result establishes that the item is unbiased
DMH cannot be trusted for items in the middle difficulty range
The Mantel-Haenszel procedure assumes the DIF effect is uniform — the same direction and magnitude at every point on the ability scale. It summarizes data into a single odds ratio across strata. If the item actually shows non-uniform DIF (favoring one group at low ability but the other at high ability), MH's averaging would obscure the effect. Logistic regression, which includes an interaction term between group and total score, can detect non-uniform DIF. This distinction matters practically because non-uniform DIF cannot be corrected by aggregate adjustments.
Question 2 Multiple Choice
A research team wants to compare latent mean scores on a depression scale between British and Korean samples to determine whether one population is more depressed on average. What statistical requirement must be met for this comparison to be valid?
AThe scale must achieve Cronbach's alpha ≥ 0.80 in both samples
BThe samples must be matched on age, gender, and education
CScalar measurement invariance must hold — the same factor loadings and item intercepts across groups
DNo individual item should show significant DIF in either sample
Latent mean comparisons require scalar invariance: not only must the items load on the same factors with the same magnitudes (metric invariance), but the items' intercepts — their baseline response tendencies — must be the same across groups. If intercepts differ, group members at the same latent level of depression would respond differently to items, making the scales non-comparable. Scalar invariance is the specific, often-violated condition that licenses cross-group latent mean comparison. High reliability does not ensure invariance; matching on demographics does not replace measurement equivalence testing.
Question 3 True / False
Non-uniform DIF is more problematic than uniform DIF because it cannot be corrected by simply adjusting total scores — the group difference changes direction or magnitude across the ability distribution.
TTrue
FFalse
Answer: True
Uniform DIF produces a consistent advantage for one group at all ability levels — while unfair, it creates a predictable, constant offset that might be addressable through item removal or score adjustment. Non-uniform DIF is more insidious: the advantage switches direction (or varies substantially) across ability levels, meaning there is no single correction that equalizes group performance. It distorts the measurement relationship differentially, undermining the validity of the test for all score comparisons between groups.
Question 4 True / False
Establishing that a scale has the same factor structure (configural invariance) in two groups is sufficient to support valid comparisons of latent means across those groups.
TTrue
FFalse
Answer: False
Configural invariance only establishes that the same items load on the same factors in both groups — it says nothing about whether the loadings or intercepts are numerically equal. Metric invariance (equal loadings) is required before comparing relationships between latent variables. Scalar invariance (equal loadings AND equal intercepts) is required before comparing latent means. Each level of invariance is a stronger constraint; only scalar invariance licenses the specific claim that a given latent score represents the same standing across groups.
Question 5 Short Answer
Why is test bias detection considered a form of validity evidence collection, rather than a separate psychometric concern?
Think about your answer, then reveal below.
Model answer: Validity is the degree to which a test measures what it is intended to measure. If a test item or scale measures one construct for one group but a slightly different construct (or the same construct plus group-related noise) for another, the test is not valid for cross-group comparisons — regardless of its internal consistency. Detecting DIF and invariance violations is directly testing whether the measurement model holds across groups, which is a core validity question. Bias is a specific type of construct-irrelevant variance, and bias detection is the empirical process of identifying it.
The connection to validity is not just definitional — it has practical consequences. A test reported to be 'reliable and valid' in a general sense may still be invalid for specific comparisons (e.g., group mean differences) if measurement invariance has not been tested. Bias detection methods operationalize the validity inquiry: they turn the abstract question 'does this test mean the same thing for everyone?' into testable statistical hypotheses.