Questions: Classical and IRT-Based Item Analysis Compared
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A test item has a p-value of 0.85 when administered to a sample of college graduates. The same item has a p-value of 0.52 when given to a sample of high school students. The best interpretation is:
AThe item was scored incorrectly for one of the groups
BThis demonstrates that p-value is a sample-dependent statistic, not a fixed property of the item
CThe item discriminates poorly because its difficulty appears to change across groups
DThe high school students received a flawed test administration
The p-value (proportion answering correctly) reflects both item properties and the ability distribution of the sample. A more able group will always produce a higher p-value on the same item — not because the item changed, but because CTT statistics conflate item and sample. This is the central limitation of classical test theory: you cannot determine whether a difference between groups reflects the item or the examinees. IRT addresses this by estimating item parameters that are theoretically invariant across populations.
Question 2 Multiple Choice
A testing company needs to build an item bank and compare scores across different test forms administered to different cohorts each year. Which measurement approach is most appropriate?
AClassical test theory, because p-values and point-biserials are simpler to compute and interpret
BIRT, because item parameter estimates are theoretically invariant across populations, enabling score equating across different forms and cohorts
CClassical test theory, because point-biserial correlations capture the same information as IRT discrimination parameters
DEither approach works equally well for equating scores across test forms
Score equating — placing scores from different test forms on a common scale — requires that item properties remain constant across administrations. CTT item statistics are sample-dependent, so a p-value from one cohort cannot directly inform interpretation in another. IRT item parameters (difficulty b, discrimination a) are calibrated to be population-invariant, allowing the test developer to treat item properties as fixed. This is why virtually all large-scale standardized testing programs use IRT rather than CTT for equating.
Question 3 True / False
A CTT p-value of 0.80 indicates that the item has moderate difficulty, regardless of which population is tested.
TTrue
FFalse
Answer: False
A p-value of 0.80 means 80% of the tested sample answered correctly — but that number depends entirely on the ability level of the sample. The same item might have p = 0.95 with a highly able group and p = 0.40 with a struggling group. CTT p-values describe the item-in-context, not the item alone. The statement would be approximately true only if you always test the same population, which is rarely the case. This sample-dependence is the fundamental limitation CTT cannot escape.
Question 4 True / False
IRT item parameter estimates allow test developers to place items from different test forms onto a common scale and compare their properties, even if those forms were administered to different groups.
TTrue
FFalse
Answer: True
This is the key practical advantage of IRT over CTT. Because IRT item parameters are theoretically invariant across populations (conditional on good model fit), a difficulty parameter estimated from one cohort should apply to other cohorts. This parameter invariance makes score equating and item banking possible: a calibrated item difficulty can be used to predict performance in new groups without re-administering the item to a full new sample. Large-scale adaptive testing systems depend entirely on this property.
Question 5 Short Answer
What is the fundamental limitation of classical test theory item statistics, and how does IRT address it?
Think about your answer, then reveal below.
Model answer: CTT statistics (p-value, point-biserial) conflate item properties with sample properties — the same item appears easier or harder depending on who takes the test, making CTT statistics not portable across populations. IRT models the probability of a correct response as a function of both examinee ability and item-specific parameters estimated separately. Once calibrated, IRT item parameters are theoretically invariant across populations, separating what is in the item from what is in the examinees.
The sample-dependence of CTT is not merely a technical inconvenience — it means CTT statistics cannot be meaningfully compared across testing occasions unless the same population is tested each time. This makes CTT unsuitable for item banking, score equating, and computerized adaptive testing. IRT solves this by defining items in terms of the ability scale (theta) rather than the proportion correct in a particular sample, trading computational simplicity for measurement precision and portability across contexts.