Questions: Generalizability Theory and Multi-Faceted Reliability
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A G-study examining essay scoring reveals the following variance components: 45% from persons, 10% from raters (main effect), 30% from the person×rater interaction, and 15% from items. Based on these results, which change to the test design would most improve the generalizability coefficient?
AAdding more essay prompts, since item variance must be reduced to improve reliability
BIncreasing the number of raters, since rater-related variance (rater main effect + person×rater interaction) represents the largest source of error
CCollecting scores on multiple occasions, since occasion variance is usually the biggest facet in performance assessments
DReducing the number of raters to one highly trained expert, eliminating the person×rater interaction entirely
The variance components tell you where error is coming from. Here, rater-related variance totals 40% (10% rater main effect + 30% person×rater interaction) — the dominant error source. Adding more raters averages across their idiosyncratic scoring tendencies, reducing this noise. A D-study would formalize this forecast by predicting the G-coefficient at different rater counts. Adding more items would help if item variance were the bottleneck, but the data show raters are. Using one 'expert' rater eliminates the person×rater variance from an ANOVA perspective only if that one rater scores all essays perfectly consistently.
Question 2 Multiple Choice
How does a D-study (decision study) differ from a G-study (generalizability study)?
AA D-study collects new data under the proposed design; a G-study applies those results to actual test decisions
BA D-study uses the variance components estimated in the G-study to forecast how changing the number of conditions (raters, items, occasions) would affect the generalizability coefficient — without collecting new data
CA D-study estimates variance due to person differences; a G-study estimates variance due to facets like raters and items
DA D-study replaces CTT reliability calculations; a G-study supplements them
The G-study is the data-collection phase: participants are measured across multiple conditions of each facet, and the resulting data are analyzed to estimate how much variance each source (persons, raters, items, their interactions) contributes. The D-study then uses those variance component estimates to ask 'what-if' questions without collecting new data: if we used 4 raters instead of 2, how much would G improve? If we added 3 more items? This allows test designers to optimize the measurement design before committing to it, identifying the most cost-effective route to a target reliability level.
Question 3 True / False
A G-coefficient computed for a two-rater, two-item performance test addresses a more precisely defined reliability question than Cronbach's alpha computed on the same data.
TTrue
FFalse
Answer: True
Cronbach's alpha treats all non-person variance as undifferentiated error. A G-coefficient is specified for a particular universe of generalization — it answers 'how well do scores generalize across the specific facets included in this design?' If the design has two raters and two items, the G-coefficient tells you how reliably you can generalize to another pair of raters using another pair of items. A different G-study with different facets would yield a different G-coefficient. This specificity is both a strength (more actionable) and a limitation (not directly comparable across different designs).
Question 4 True / False
Generalizability theory renders classical test theory obsolete because it can answer most of the questions CTT can, plus provide facet-specific variance information.
TTrue
FFalse
Answer: False
G-theory and CTT are complementary, not competitive. CTT is simpler, requires less data, and is sufficient when the measurement involves a single dominant source of error (typically items). G-theory is indispensable when multiple facets are present — raters, occasions, testing sites — because only G-theory can identify which facet is the bottleneck and what redesigning the test around that bottleneck would yield. Choosing G-theory when CTT suffices adds unnecessary complexity; choosing CTT when multiple facets are present obscures the structure of error.
Question 5 Short Answer
Why can't you improve a test's reliability simply by examining its overall Cronbach's alpha, and what additional information does G-theory provide?
Think about your answer, then reveal below.
Model answer: Cronbach's alpha lumps all sources of error into one undifferentiated 'error variance' term, so you know reliability is low but not why. G-theory decomposes error into named facets (raters, items, occasions) and their interactions, revealing which specific source is the bottleneck. This allows targeted interventions — add more raters if rater variance dominates, add more items if item variance dominates — rather than guessing.
Imagine alpha = 0.68. Is the problem inconsistent items? Inconsistent raters? Performance that varies across occasions? Alpha cannot tell you. A G-study might reveal that person×rater interaction accounts for 35% of variance and items only 4%. Adding items would barely move reliability; adding raters would substantially improve it. The D-study then calculates exactly how many raters produce a target G of 0.85. Without G-theory, you can only observe that reliability is low; with it, you can diagnose why and prescribe a specific remedy.