Questions: Generalizability Studies: Design and Analysis
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A G-study of a clinical skills exam reveals that rater variance accounts for 38% of total score variance, item variance accounts for 6%, and person variance accounts for 42%. You have a limited budget to improve reliability. What does a D-study direct you to do?
AAdd more items, because more items always reduce the largest source of error
BAdd more raters, because rater variance is the dominant error source and adding raters averages it out
CAdd more occasions, because occasion effects are always the largest source of error in performance assessments
DReduce the number of items to shorten the test and reduce candidate fatigue
The D-study uses G-study variance component estimates to project how changing facet levels affects the generalizability coefficient. When rater variance dominates, adding more raters reduces that error source most efficiently — each additional rater averages out idiosyncratic leniency or stringency. Adding items would only help if item variance were large. The point of the two-step G/D workflow is precisely to move from guesswork to principled resource allocation based on empirical variance component estimates.
Question 2 Multiple Choice
A licensing board uses an oral exam to certify whether candidates meet a minimum competency standard of 75 points. Should they compute an absolute or relative generalizability coefficient, and why?
ARelative, because they are ultimately comparing candidates against each other to award licenses
BAbsolute, because the decision is about meeting a fixed standard — a lenient rater who inflates everyone's scores changes who passes, even without changing rankings
CEither coefficient, since both are mathematically equivalent when the decision threshold is fixed
DRelative, because it is always more conservative and therefore safer for high-stakes decisions
The absolute/relative distinction maps directly onto the decision structure. For relative decisions (ranking, selecting the top N%), systematic facet effects like overall rater leniency cancel out — if one rater gives everyone 10 points more, ranks are unchanged. But for absolute decisions (meeting a fixed threshold), those systematic effects matter enormously: a lenient rater pushes borderline candidates over the cut score. The absolute coefficient includes all error variance in the denominator; the relative coefficient excludes facet main effects. Using the wrong coefficient for a licensing exam can systematically misrepresent the measurement's accuracy.
Question 3 True / False
In G-theory, a lenient rater who gives every candidate a 10-point score inflation affects absolute decisions (pass/fail against a fixed standard) but not relative decisions (ranking candidates against each other).
TTrue
FFalse
Answer: True
This is the core intuition behind the absolute/relative distinction. For relative decisions, what matters is whether candidates' rank ordering is preserved — uniform inflation shifts everyone equally, leaving ranks intact. For absolute decisions, a 10-point inflation systematically changes who clears the fixed cut score. G-theory formalizes this by including or excluding facet main effects in the error variance term depending on which type of decision is being made. Classical reliability coefficients conflate the two cases.
Question 4 True / False
Internal consistency coefficients like Cronbach's alpha are sufficient for evaluating the reliability of performance assessments involving multiple raters, tasks, and occasions, making G-study analyses unnecessary.
TTrue
FFalse
Answer: False
Internal consistency coefficients only capture item-level variance within a single administration — they cannot separate rater disagreement, occasion fluctuation, or task-specific variance as distinct error sources. For a performance assessment with three raters and five tasks, Cronbach's alpha can tell you whether items co-vary, but it cannot tell you whether your reliability problem is rater disagreement (fix: add raters) versus task inconsistency (fix: add tasks). G-theory provides the richer diagnostic lens that classical reliability entirely lacks.
Question 5 Short Answer
What is the practical difference between a G-study and a D-study, and why do you need both?
Think about your answer, then reveal below.
Model answer: A G-study is a data collection designed to estimate the variance components for each facet of the measurement situation — it answers 'how much of the score variance comes from persons, raters, items, occasions, and their interactions?' A D-study uses those variance component estimates to project how the generalizability coefficient would change under different test designs — it answers 'if I use two raters instead of three, or eight items instead of five, what reliability would I achieve?' You need the G-study to produce the empirical estimates that make the D-study projections accurate. Without the G-study, test design is guesswork; without the D-study, the variance components are just descriptive statistics with no actionable implications.
The two-step workflow transforms G-theory from an interesting measurement framework into a practical design tool. The G-study produces the raw inputs (variance components); the D-study converts them into engineering specifications (how many raters/items/occasions do I need to reach a G-coefficient of 0.85?). Neither step alone answers the test designer's practical question.