A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Reliability in Psychological Measurement

College Depth 64 in the knowledge graph ☐ I know this ☆ Set as goal

143topics build on this

293prerequisites beneath it

Operational Definitions Correlation Coefficient +3 more→→Inter-Rater Reliability and Observer Agreement in Measurement Measurement Error and Attenuation of Effects +6 more

Core Idea

Reliability is the consistency of a measurement — the degree to which it produces the same result under the same conditions. Types include test-retest reliability (consistency over time), inter-rater reliability (consistency across observers), and internal consistency (coherence among items in a scale, assessed with Cronbach's alpha). Reliability is a necessary but not sufficient condition for validity: a measure can be reliably wrong. High reliability puts a ceiling on how valid a measure can be — an unreliable measure cannot be valid.

How It's Best Learned

Compute Cronbach's alpha for a simple 5-item scale dataset. Then compare a reliable and unreliable measure of the same construct and explain how each would affect a study's conclusions.

Common Misconceptions

Reliability does not mean accuracy — a bathroom scale consistently reading 5 kg too heavy is reliable but not valid.
Cronbach's alpha above .70 is often cited as 'acceptable,' but context matters — clinical measures may require .90+.

Explainer

Before you can trust the conclusions of any psychological study, you need to ask a foundational question: is the measurement actually measuring what it claims to measure, and is it doing so consistently? Reliability addresses the second part — consistency — and it is a prerequisite for the first.

Think of reliability as the spread of measurement error. Every measurement contains some true score and some random error. If you repeat a measurement under identical conditions and get widely different results, the instrument has low reliability: most of the variance is error, not signal. Three types of reliability capture different sources of inconsistency. *Test-retest reliability* checks whether scores are stable over time (important for traits like personality, less important for states like mood). *Inter-rater reliability* checks whether different judges or observers score the same behavior consistently — critical in clinical diagnosis or behavioral observation research. *Internal consistency* checks whether items within a scale all tap the same underlying construct.

Cronbach's alpha, the standard measure of internal consistency, ranges from 0 to 1. It is mathematically equivalent to the average of all possible split-half correlations for the scale. An alpha of .80 means the items are moderately intercorrelated and likely measuring a coherent construct. The commonly cited cutoff of .70 is a rough rule for exploratory research; clinical tools that inform treatment decisions typically require .90 or higher, because low reliability means individual patients could score very differently on retesting through no real change in their condition.

The most important conceptual point is the relationship between reliability and validity. Reliability is a *necessary but not sufficient condition* for validity. A measure can be perfectly reliable — consistent, stable, precise — yet completely wrong. A thermometer that consistently reads 5°C too high is reliable; a personality questionnaire that consistently measures neuroticism when you think it's measuring extraversion is reliable. Neither is valid. In formula terms: validity ≤ √reliability. An unreliable measure cannot be valid because measurement error attenuates correlations with external criteria; a reliable measure might still be invalid because it is consistently measuring the wrong construct.

When you encounter a published scale, look at how reliability was assessed and in what population. Reliability is not a fixed property of an instrument — it depends on the range and homogeneity of scores in the sample. A scale with excellent reliability in a diverse community sample may show much lower reliability in a clinically homogeneous group, simply because there is less true-score variance for the items to capture. Understanding this context-dependence is essential for evaluating whether a measure is fit for purpose in a new research setting.

Practice Questions 3 questions