Questions: Reliability Estimation Methods and Method Selection
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A researcher develops a mood scale with α = .91 and publishes it. A clinician wants to use the scale to track whether a patient's mood improves across six weekly therapy sessions. What critical reliability evidence is missing from the published report?
ANothing — α = .91 is sufficient reliability evidence for any use case, including repeated clinical measurement
BTest-retest reliability — because α measures item homogeneity at a single time point but tells you nothing about whether scores are stable across sessions when nothing has truly changed
CInter-rater reliability — because clinicians will score the items differently than the original researchers did
DA larger sample, since α is only valid when computed on samples over 500
Cronbach's α tells you that the items correlate well with each other right now — but it says nothing about temporal stability. For a clinical application tracking change over weeks, you need evidence that stable patients (no real change) produce similar scores across administrations. High internal consistency is compatible with very low test-retest reliability if the construct is genuinely volatile or if situational factors influence responding. The clinician's use case demands test-retest evidence that α simply cannot provide.
Question 2 Multiple Choice
Two clinicians independently code 50 structured psychiatric interviews to diagnose PTSD using binary yes/no criteria. Which reliability statistic is most appropriate?
ACronbach's alpha, to assess whether the diagnostic criteria are internally consistent
BCohen's kappa, which corrects for chance agreement between two raters on categorical judgments
CTest-retest reliability, since the same interviews should produce stable diagnoses regardless of rater
DPearson correlation, since it captures how consistently the two raters rank patients
This is a classic inter-rater reliability scenario: two human raters making categorical judgments. Cohen's kappa is designed exactly for this — it measures the agreement between raters beyond what chance would produce (percent agreement ignores that raters could agree by luck on a 50/50 binary variable). Pearson correlation is inappropriate for nominal data. Cronbach's α addresses item homogeneity, not rater agreement. Test-retest addresses temporal stability, not judge-to-judge consistency.
Question 3 True / False
A personality questionnaire can have high internal consistency (α = .90) but low test-retest reliability if the measured construct is genuinely unstable across time.
TTrue
FFalse
Answer: True
Exactly right — and this is a crucial insight. High α means items agree with each other about where a person stands today. It says nothing about whether the person scores similarly next week. If the construct actually fluctuates (e.g., daily mood, situational anxiety), test-retest will be low even though the scale is measuring something real and precisely. These are different error sources: item homogeneity vs. temporal stability. You cannot infer one from the other.
Question 4 True / False
A single well-chosen reliability coefficient is generally sufficient to establish the reliability of a psychological measure for research and clinical use.
TTrue
FFalse
Answer: False
False — this is the most common mistake in applied psychometrics. Different methods capture different error sources: internal consistency (item homogeneity), test-retest (temporal stability), and inter-rater (judge variability). A personality questionnaire with high α may still have untested temporal stability; a clinical interview with good test-retest may have hidden inter-rater disagreement. Complete reliability evidence requires addressing every error source relevant to the measure's intended use — which almost always means multiple estimates.
Question 5 Short Answer
Why is Cronbach's alpha insufficient as the only reliability evidence for a structured clinical interview that is scored by different clinicians?
Think about your answer, then reveal below.
Model answer: Alpha measures item homogeneity at a single time point — whether the interview items correlate with each other. But a clinical interview has a critical additional error source: rater variability. Two clinicians applying the same criteria may still reach different diagnoses due to differences in training, interpretation, or judgment. Alpha is blind to this source of error. Inter-rater reliability (e.g., Cohen's kappa or an intraclass correlation coefficient) is needed to determine whether two clinicians would consistently agree when scoring the same patient.
The governing principle is: choose the reliability method that directly estimates the primary error source for your measurement context. For a clinical interview, rater variability is at least as important as item coherence — but alpha tells you nothing about it. Reporting only alpha and calling the measure 'reliable' misleads users into thinking a source of error has been ruled out when it has merely been ignored.