Questions: Inter-Rater Reliability and Observer Agreement in Measurement
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
Two raters independently code 100 therapy transcripts as showing 'empathy' or 'not empathy.' They agree on 92 transcripts (92%). However, because 90% of transcripts are coded 'not empathy' by both raters, chance agreement alone would produce about 82% agreement. Is 92% a strong result?
AYes — 92% agreement is exceptionally high and demonstrates reliable measurement.
BNo — the beyond-chance agreement is only about 10 percentage points above the chance baseline; Cohen's kappa would be modest, not strong.
CNo — only 100% agreement is acceptable in clinical research, since anything less introduces bias.
DYes — when raters are trained independently, any agreement above 80% is considered strong regardless of chance rates.
Percent agreement is inflated by the base rate of the most common category. When both raters mostly code 'not empathy' (90% of cases), they would agree by chance on approximately 81% of cases (0.9² + 0.1² ≈ 0.82). The 92% observed agreement is only ~10 points above chance. Cohen's kappa, κ = (0.92 − 0.82)/(1 − 0.82) ≈ 0.56, indicates moderate rather than strong agreement. Percent agreement alone would mislead you into thinking the measurement is more reliable than it is.
Question 2 Multiple Choice
A researcher reports high inter-rater reliability for a behavioral coding scheme that categorizes therapist behaviors. A critic argues that the coding scheme may not actually capture 'therapeutic alliance' as the researcher intends. What is the critic addressing?
AInter-rater reliability — the raters may be consistently making the same coding errors.
BValidity — the measure may be reliable (consistent across raters) without actually capturing the intended psychological construct.
CInternal consistency — the individual items in the coding scheme may not correlate with each other.
DTest-retest reliability — coders may rate the same transcript differently on different occasions.
High inter-rater reliability only demonstrates that raters agree consistently — it says nothing about whether the coding scheme measures what it claims to measure. Two raters can reliably and consistently code the wrong thing. Validity asks whether the measurement captures the intended construct; reliability asks whether it produces consistent results. These are logically independent: a measure can be highly reliable but invalid (consistently measuring the wrong thing), or valid but unreliable (measuring the right thing inconsistently).
Question 3 True / False
Low inter-rater reliability is most commonly caused by inadequate operational definitions that leave room for legitimate interpretive differences between coders.
TTrue
FFalse
Answer: True
When raters reliably disagree, the problem is rarely that one rater is careless or poorly trained — it is usually that the coding rules leave room for more than one defensible interpretation. Sharpening operational definitions (replacing vague terms with specific behavioral anchors, providing examples of boundary cases, conducting calibration sessions) is the standard remedy. This process is also epistemically valuable: it forces researchers to specify exactly what they mean by their constructs, often revealing ambiguities that were hidden in plain sight.
Question 4 True / False
High inter-rater reliability is sufficient evidence that a measure is valid — if observers consistently agree, the measure is expected to be capturing the real phenomenon.
TTrue
FFalse
Answer: False
Reliability and validity are independent properties. High agreement means observers are applying the same criteria consistently — but those criteria might be consistently measuring something other than the intended construct. For example, coders might reliably agree on whether a behavior occurred (high reliability) while that behavior turns out not to predict the outcome researchers care about (low validity). Reliability is necessary but not sufficient for validity: you need consistency to measure anything at all, but consistency alone doesn't guarantee you're measuring the right thing.
Question 5 Short Answer
Why is percent agreement alone insufficient for evaluating inter-rater reliability, and what does Cohen's kappa add?
Think about your answer, then reveal below.
Model answer: Percent agreement ignores how much agreement would occur by chance, given the base rates of each category. When one category dominates (e.g., 90% of observations are 'absent'), two raters can agree on the vast majority of cases simply by independently defaulting to the dominant category — with no actual shared judgment. Cohen's kappa corrects for this: κ = (P_observed − P_chance) / (1 − P_chance). It measures the agreement *above and beyond* what chance alone would produce, giving a more accurate picture of whether the raters are genuinely applying the same criteria.
The practical implication is that high percent agreement in highly skewed distributions can mask very low actual reliability. A κ of 0 means the raters are no more consistent than random chance; a κ of 1 means perfect agreement. Values above .70 are generally considered acceptable. Kappa punishes inflated agreement due to base rates, making it the appropriate statistic when categories are not equally frequent — which is most of the time in behavioral research.