Questions: IRT Model Comparison and Fit Evaluation
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A psychometrician tests a 50-item certification exam. The 2PL fits significantly better than the Rasch model by likelihood ratio test (p < .001), but item discrimination parameters vary narrowly (range: 0.85–1.15). The test will be used for large-scale adaptive testing across multiple years and examinee populations. The most defensible model choice is:
AAlways the 2PL — a statistically significant fit difference must be respected
BThe 3PL — if the 2PL fits better than Rasch, the 3PL likely fits even better and should be explored
CThe Rasch model — the fit improvement is trivially small, and Rasch's sample-independent calibration property is valuable for adaptive testing and equating across populations
DNeither — the narrow discrimination range means the items are too similar and should be revised before model selection
This question illustrates the core principle that model selection in IRT is not a statistical algorithm. With large samples, even trivial improvements in fit can be statistically significant. When item discriminations vary only modestly (0.85–1.15 is close to Rasch's assumption of 1.0), the practical gain from 2PL is minimal, while Rasch's unique measurement property — sample-independent item calibration — is highly valuable for adaptive testing and test equating. Option A commits the error of treating significance as equivalent to practical importance; options B and D introduce unnecessary complexity.
Question 2 Multiple Choice
Why are information criteria like AIC and BIC often preferred over the likelihood ratio test alone for comparing IRT models in large psychometric samples?
ABecause AIC and BIC can compare non-nested models, whereas the LRT is restricted to nested model families
BBecause in large samples the LRT almost always rejects the simpler model regardless of practical significance, while AIC and BIC penalize complexity and measure whether added parameters earn their keep
CBecause the LRT requires normality assumptions that are violated in IRT data
DBecause AIC is always lower for more complex models, making it a reliable guide to model selection
The fundamental problem with using the LRT alone in large psychometric samples is that with thousands of examinees, even trivially small differences in fit produce significant chi-square values. AIC and BIC impose explicit penalties for complexity (AIC: 2k; BIC: k·ln(n)), asking not just 'is the complex model better?' but 'is the improvement worth the extra parameters?' BIC's heavier penalty makes it especially conservative in large samples. Option A is partially true (AIC/BIC can compare non-nested models) but not the *primary* reason for their use here; option D is wrong — lower AIC favors the model that best balances fit and parsimony, not simply the most complex one.
Question 3 True / False
A model can show acceptable global fit statistics while individual items within it misfit the model's predictions badly.
TTrue
FFalse
Answer: True
Global fit statistics (LRT, AIC, BIC) summarize fit across all items and examinees. A model can achieve good aggregate fit while specific items have response patterns that deviate substantially from the model's predicted item response functions. Item-level infit and outfit statistics are essential diagnostics precisely because global fit can mask local misfitting items. A test with five badly misfitting items among fifty is not trustworthy even if global statistics look acceptable.
Question 4 True / False
When a likelihood ratio test shows the 3PL fits significantly better than the Rasch model, the 3PL should generally be selected for the final test.
TTrue
FFalse
Answer: False
Statistical significance of the likelihood ratio test is necessary but not sufficient for model selection in IRT. The psychometrician must also weigh the practical utility of the models, the size of the fit improvement relative to added parameters (via AIC/BIC), the stability of parameter estimates, item-level fit, and the intended use of the test. The Rasch model's sample-independent calibration property may be worth the marginal fit cost — a judgment that no statistical test can make automatically.
Question 5 Short Answer
What is the unique measurement property of the Rasch model that makes it especially valuable for large-scale or adaptive testing, and under what conditions might this property justify choosing Rasch over a 2PL that fits the data better?
Think about your answer, then reveal below.
Model answer: The Rasch model's unique property is sample-independent item calibration: when Rasch assumptions hold, person ability and item difficulty are on the same scale, and item parameters estimated from one sample apply to a different sample without re-estimation. This makes Rasch ideal for adaptive testing (items from a calibrated bank can be administered to any examinee), test equating across years, and measurement across different populations. A psychometrician might choose Rasch over a better-fitting 2PL when item discriminations vary only modestly, the fit improvement is practically small, and the test requires the measurement stability that only Rasch provides.
The key distinction is that model selection involves a tradeoff between empirical fit and measurement utility. A model with slightly worse fit but superior theoretical properties for the test's purpose can be the right choice. This is what makes IRT model comparison a professional judgment, not a statistical procedure.