Questions: Item Selection and Item Pool Development for Tests
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A test developer selects the 20 items with the highest discrimination indices from a pilot pool of 60 items and declares the test complete. What is the most significant problem with this approach?
ATwenty items is too few — reliability requires at least 40 items
BUsing only the top discriminating items may produce a test that covers only a narrow slice of the construct, violating content validity
CDiscrimination indices are not meaningful until the final test is assembled
DHigh-discrimination items are typically too difficult for most test-takers
Item selection is constrained optimization, not simple maximization. Selecting purely on discrimination ignores the content blueprint: if the top 20 discriminating items all happen to assess the same sub-domain, the test has high internal consistency but poor content validity — it measures some of the construct very reliably and ignores the rest entirely. Good item selection maximizes reliability within the binding constraint that each specified content area is adequately represented. Option A is a common intuition but not a universal rule; test length is a function of reliability goals and content requirements, not a fixed minimum.
Question 2 Multiple Choice
An item on a medical licensing exam has a p-value of 0.97, meaning 97% of test-takers answered it correctly. What does this tell you about its contribution to the test?
AIt is an excellent item — all examinees answered it correctly, confirming mastery of this content area
BIt contributes almost nothing to reliability because it produces virtually no variance — nearly everyone passes it regardless of true ability
CIts discrimination index will be high because most examinees got it right
DIt should be removed only if it also has low face validity
Reliability is driven by variance in item scores across test-takers. When 97% answer correctly, the item produces almost no variance — it cannot distinguish among test-takers because nearly everyone passes it. A discrimination index measures correlation between item score and total score; with near-zero variance on the item, that correlation will be near zero as well. The item is statistically useless for measuring individual differences. Note: very easy items can serve other purposes (reducing anxiety, serving as warm-up) but should not make up the majority of a test designed to reliably differentiate among candidates.
Question 3 True / False
An item that every test-taker answers correctly contributes nothing to the reliability of the test.
TTrue
FFalse
Answer: True
True. Reliability — the proportion of score variance attributable to true differences between people — depends on items that produce variance in scores. If every test-taker answers an item correctly (p-value = 1.0), that item has zero variance. Zero-variance items cannot correlate with anything, including the total test score, so their discrimination index is zero. They contribute no information about individual differences and do not improve reliability. They may still serve other purposes (e.g., gauging absolute mastery of a critical safety item), but they are statistically inert for reliability purposes.
Question 4 True / False
The goal of item selection is to maximize the average discrimination index across most selected items, regardless of other considerations.
TTrue
FFalse
Answer: False
False. Item selection is constrained optimization: the goal is to maximize reliability (which discrimination indices contribute to) *within* the requirement that each content area specified in the test blueprint meets its item quota. A purely statistical approach that ignores the content blueprint would sacrifice content validity — the test might be internally consistent but measure only a narrow part of the intended construct. Additionally, item difficulty distribution matters: an overly narrow difficulty range reduces the test's ability to differentiate across the ability spectrum. Discrimination is one important criterion, not the sole objective.
Question 5 Short Answer
Why must test developers write an item pool two to three times larger than the final test, rather than writing exactly the items they intend to use?
Think about your answer, then reveal below.
Model answer: Because item properties cannot be predicted before pilot testing. Some items will turn out too easy or too hard (p-values near 0 or 1), producing little variance and low discrimination. Others will be ambiguous, poorly worded, or biased against subgroups — problems that only emerge when real test-takers respond. A large pool provides enough candidates in each content area that after eliminating poorly performing items, sufficient high-quality items remain to fill the content blueprint without compromising difficulty distribution or reliability.
The item pool exists to absorb attrition. Pilot testing is the quality control step that reveals which items work as intended. If you start with exactly the items you plan to use and several fail psychometric criteria, you either have to use weak items (harming reliability) or leave content areas undercovered (harming validity). The pool provides redundancy: for each content specification, you want multiple viable candidates so that statistical filtering still leaves adequate coverage.