Inter-rater reliability (or inter-observer agreement) measures the degree to which different observers or raters independently arrive at the same conclusions when evaluating the same phenomena, whether through behavioral coding, clinical judgment, or content analysis. High inter-rater reliability indicates that the measurement procedure produces consistent results across observers, providing evidence that measurements reflect the phenomenon rather than idiosyncratic observer biases. Common statistics include Cohen's kappa, intraclass correlations, and percent agreement. Low agreement suggests unclear operational definitions, inadequate training, or subjective measurement.
Have multiple coders independently code the same sample of data and calculate agreement statistics to identify sources of disagreement.
High inter-rater agreement means the measure is valid (actually, agreement only ensures consistency, not that the measure captures the intended construct). Perfect agreement is necessary and achievable (actually, some level of disagreement is inevitable and acceptable depending on the context).
From your study of reliability, you know that a measure must produce consistent results to be useful — and you know that consistency can be assessed across time (test-retest), across items (internal consistency), and across forms (parallel forms). Inter-rater reliability adds a fourth dimension: consistency across observers. Whenever a measurement procedure requires a human judge to categorize, rate, or code something — whether counting aggressive behaviors on a playground, rating interview responses for quality, or coding therapy transcripts for therapist empathy — the measurement is only as reliable as the agreement between different observers. Without this check, you cannot distinguish signal (the real phenomenon) from noise (the idiosyncratic perceptions of one coder).
The most important conceptual step is understanding why percent agreement is insufficient on its own. Suppose two raters independently code whether each of 100 behaviors is "aggressive" or "nonaggressive," and they agree on 90 of them. 90% agreement sounds impressive — but what if both raters would have agreed on 85% of cases by chance alone, simply because most behaviors are nonaggressive? The agreement attributable to the measurement is only 5 percentage points above chance. Cohen's kappa corrects for this by comparing observed agreement to the agreement expected by chance: κ = (P_o − P_e) / (1 − P_e). A kappa of 0 means agreement no better than chance; a kappa of 1.0 means perfect agreement. Values above .70 are generally considered acceptable; above .80 is considered strong.
Intraclass correlations (ICC) are used when raters assign continuous numerical ratings (e.g., rating interview performance on a 1–10 scale) rather than categorical codes. ICC estimates the proportion of score variance attributable to real differences between the things being rated, versus differences between raters or random noise. The appropriate form of ICC depends on whether the same raters rate everyone (two-way ICC) or different raters rate different targets (one-way ICC), and whether you are interested in absolute agreement or merely rank-order consistency.
Low inter-rater reliability is almost always a symptom of inadequate operational definitions. If two coders reliably disagree, the usual explanation is not that one is careless — it is that the coding scheme leaves room for legitimate interpretive differences. The remedy is to sharpen the definition: replace vague terms with specific behavioral anchors, provide examples of boundary cases, conduct calibration sessions where coders discuss disagreements, and iterate until the coding rules leave minimal room for interpretation. This process of achieving rater agreement is itself epistemically valuable — it forces researchers to specify exactly what they mean by the constructs they are measuring, which often reveals conceptual ambiguities that were hiding in plain sight.