IRT Model Comparison and Fit Evaluation

Research Depth 79 in the knowledge graph I know this Set as goal
model-selection goodness-of-fit likelihood-ratio aic-bic

Core Idea

Comparing IRT models requires examining fit statistics (likelihood ratio tests, AIC, BIC), item-level residuals, and practical utility. Model selection balances parsimony with empirical fit. A simpler model (Rasch) may be preferred even if more complex models (2PL, 3PL) fit better, depending on measurement goals and resources.

Explainer

You have now studied three IRT model families: the Rasch (1PL) model, the 2PL, and the 3PL. Each adds one more parameter to account for more item-level variation—the 2PL adds discrimination (how steeply the item distinguishes low from high ability), and the 3PL adds a pseudo-guessing parameter (the probability that a very low-ability examinee gets the item right by chance). The natural question is: which model should you use? The answer requires balancing two competing pressures that should already be familiar from your study of probability and statistical inference—fit and parsimony.

The most direct statistical tool for comparing nested IRT models is the likelihood ratio test (LRT). Because the Rasch model is a constrained version of the 2PL (with all discriminations fixed to 1), and the 2PL is a constrained version of the 3PL (with all guessing parameters fixed to 0), these models are nested. The LRT compares the log-likelihoods of two models: if the more complex model fits the data significantly better (chi-square test on the difference in log-likelihoods, with degrees of freedom equal to the difference in number of estimated parameters), you have evidence that the additional parameters are justified. When you studied the chi-square test, you encountered this same logic: a significant result means the simpler model's constraints are inconsistent with the data.

However, statistical significance alone is not sufficient for model selection. With large samples—common in psychometric applications—even trivially small improvements in fit can be statistically significant. This is where information criteria become essential. The AIC (Akaike Information Criterion) penalizes model complexity as 2k − 2ln(L), where k is the number of parameters and L is the maximized likelihood. The BIC (Bayesian Information Criterion) applies a heavier penalty, 2k·ln(n) − 2ln(L), making it more conservative against overfitting in large samples. Lower values are better for both. When a 3PL model has lower AIC than the Rasch model, the gain in fit outweighs the cost of the additional parameters by the AIC's metric; the model comparison is essentially asking whether the extra parameters are "earning their keep."

Beyond global fit, item-level residuals are equally important. A model can fit overall while specific items misfit badly—individual item response functions may not match the model's predicted curves. Infit and outfit statistics flag items where observed response patterns diverge from the model's expectations, either across the full ability range (outfit) or near the item's difficulty level (infit). A model that fits globally but has many misfitting items is not trustworthy for measuring those dimensions.

The final and often decisive factor is practical utility. The Rasch model has a unique property: when its assumptions hold, person ability and item difficulty are on the same scale, enabling sample-independent item calibration—items calibrated on one sample can be used to measure a different sample without re-estimation. This property makes Rasch models especially valuable for large-scale testing programs, adaptive testing, and test equating. If the 2PL fits slightly better than Rasch by AIC but item discriminations vary only modestly, a psychometrician might prefer Rasch for its measurement properties rather than the marginal fit gain. Model selection in IRT is not a statistical algorithm—it is a judgment that weighs empirical evidence, theoretical commitments, and the uses to which the test will be put.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesAngle Pairs: Complementary, Supplementary, and VerticalParallel Lines and TransversalsCorresponding AnglesAlternate Interior AnglesTriangle Angle Sum TheoremExterior Angle TheoremTriangle Inequality TheoremSimilar Triangles: AA SimilaritySimilar Triangles: SSS and SAS SimilarityProportions in Similar TrianglesRight Triangle Trigonometry IntroductionTrigonometric Ratios ReviewRadian MeasureConverting Between Degrees and RadiansThe Unit CircleGraphing Sine and CosineGraphing Tangent and Reciprocal Trigonometric FunctionsDerivatives of Trigonometric FunctionsAntiderivativesIndefinite IntegralsBasic Integration RulesRiemann SumsDefinite Integral DefinitionProbability Density Functions and Continuous DistributionsCumulative Distribution FunctionsContinuous Random VariablesNormal DistributionClassical Test Theory FoundationsItem Response Functions and Item Characteristic CurvesRasch Model: One-Parameter Item Response TheoryTwo-Parameter Logistic IRT Model (2PL)Three-Parameter Logistic IRT Model (3PL)IRT Model Comparison and Fit Evaluation

Longest path: 80 steps · 400 total prerequisite topics

Prerequisites (5)

Leads To (0)

No topics depend on this one yet.