Questions: Inter-Rater Reliability and Observer Agreement in Measurement

5 questions to test your understanding

Score: 0 / 5
Question 1 Multiple Choice

Two raters independently code 100 therapy transcripts as showing 'empathy' or 'not empathy.' They agree on 92 transcripts (92%). However, because 90% of transcripts are coded 'not empathy' by both raters, chance agreement alone would produce about 82% agreement. Is 92% a strong result?

AYes — 92% agreement is exceptionally high and demonstrates reliable measurement.
BNo — the beyond-chance agreement is only about 10 percentage points above the chance baseline; Cohen's kappa would be modest, not strong.
CNo — only 100% agreement is acceptable in clinical research, since anything less introduces bias.
DYes — when raters are trained independently, any agreement above 80% is considered strong regardless of chance rates.
Question 2 Multiple Choice

A researcher reports high inter-rater reliability for a behavioral coding scheme that categorizes therapist behaviors. A critic argues that the coding scheme may not actually capture 'therapeutic alliance' as the researcher intends. What is the critic addressing?

AInter-rater reliability — the raters may be consistently making the same coding errors.
BValidity — the measure may be reliable (consistent across raters) without actually capturing the intended psychological construct.
CInternal consistency — the individual items in the coding scheme may not correlate with each other.
DTest-retest reliability — coders may rate the same transcript differently on different occasions.
Question 3 True / False

Low inter-rater reliability is most commonly caused by inadequate operational definitions that leave room for legitimate interpretive differences between coders.

TTrue
FFalse
Question 4 True / False

High inter-rater reliability is sufficient evidence that a measure is valid — if observers consistently agree, the measure is expected to be capturing the real phenomenon.

TTrue
FFalse
Question 5 Short Answer

Why is percent agreement alone insufficient for evaluating inter-rater reliability, and what does Cohen's kappa add?

Think about your answer, then reveal below.