Questions: Inter-Rater Reliability and Observer Agreement
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
Two clinical raters independently assess 100 patients for depression in a clinic where 95% of patients are not depressed. Both raters always code 'not depressed.' What are their percent agreement and Cohen's kappa?
APercent agreement = 95%, kappa ≈ 0
BPercent agreement = 95%, kappa ≈ 0.95
CPercent agreement = 100%, kappa = 1.0
DPercent agreement = 100%, kappa ≈ 0
Both raters agree on every case (100% of cases), so percent agreement = 100%. But their entire agreement is explained by chance: given the 95% base rate of 'not depressed,' two raters independently assigning that category would agree nearly all the time by luck. Kappa corrects for this expected chance agreement, yielding a value near 0 — meaning their agreement provides essentially no evidence of true rater concordance. This is the base rate problem: high percent agreement can be meaningless when one category dominates.
Question 2 Multiple Choice
A researcher uses percent agreement to report inter-rater reliability for a coding scheme with three behavioral categories used roughly equally (≈33% each). Compared to Cohen's kappa, what is most likely true?
APercent agreement will be lower than kappa, because it ignores systematic rater bias
BPercent agreement will be higher than kappa, because kappa subtracts the expected chance agreement
CPercent agreement and kappa will be equal, because equal base rates eliminate chance agreement
DPercent agreement will be higher than kappa, because kappa penalizes raters for using more than two categories
Kappa always subtracts expected chance agreement from observed agreement: κ = (P_o − P_e) / (1 − P_e). When categories are roughly equally used, P_e (the expected agreement by chance) is about 1/3 for a three-category scheme, so a 70% percent agreement would yield a kappa of about (0.70 − 0.33) / (1 − 0.33) ≈ 0.55 — substantially lower than the raw 70%. Percent agreement never adjusts for chance and will therefore always be ≥ kappa.
Question 3 True / False
Cohen's kappa can be 0 even when two raters show high percent agreement, if that agreement is entirely explained by the expected base rate.
TTrue
FFalse
Answer: True
This is the central insight of kappa: it measures agreement *above and beyond* what would be expected by chance. When both raters systematically use the same dominant category (because it is very prevalent), their observed agreement P_o approaches P_e, making the numerator (P_o − P_e) approach 0. Kappa thus correctly reveals that the raters are not adding independent information — they are just reflecting the base rate. This is why percent agreement alone is an inadequate reliability metric.
Question 4 True / False
A kappa of .80 is widely accepted as indicating good inter-rater reliability and can be applied as a universal threshold across most measurement contexts.
TTrue
FFalse
Answer: False
Kappa thresholds are context-dependent. In high-stakes clinical or legal settings (e.g., psychiatric diagnosis, neuroimaging interpretation), a kappa of .80 might be inadequate. In exploratory research with complex behavioral coding, a kappa of .60 might be acceptable. Standards also vary by number of categories, prevalence of categories, and the consequences of rater disagreement. The common misconception is treating any single threshold as universal — a sign that the researcher hasn't thought through the specific demands of their measurement context.
Question 5 Short Answer
Why does the prevalence of the categories being rated affect the interpretation of Cohen's kappa, and what problem does this create for researchers using binary diagnostic categories with rare conditions?
Think about your answer, then reveal below.
Model answer: Kappa's denominator adjusts for expected chance agreement, which depends on the marginal distributions — how often each rater uses each category. When one category is very rare (e.g., 5% of cases have the target condition), two raters who always say 'absent' agree 95% of the time by chance. Their kappa approaches 0 despite high percent agreement, making kappa appear very low even if both raters are doing their jobs well. Conversely, when conditions are rare and only a few discordant cases exist, small differences in rater judgment can swing kappa dramatically. This creates the 'kappa paradox': reliability appears low for rare conditions not because raters are performing poorly, but because the chance agreement baseline is so high.
This is a known limitation that has generated substantial debate in psychometrics. For rare conditions, alternatives like prevalence-adjusted bias-adjusted kappa (PABAK) or the intraclass correlation coefficient (for continuous ratings) may be more informative. The key lesson is that no single reliability metric is appropriate for all measurement contexts — understanding what a metric does and does not capture is as important as computing it.