Measurement, Validity, and Reliability

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Explainer

Measurement is where abstract concepts meet concrete data — and the gap between them is where most research goes wrong. In the social sciences, you rarely measure what you care about directly. You can't observe "intelligence," "social trust," or "political polarization" the way you can measure temperature or mass. Instead, you observe indicators: survey responses, behavioral counts, test scores, administrative records. The question of whether your indicators actually capture the concept you intend is the question of validity. This is one of the most consequential methodological issues in social science, because a finding that is technically correct but invalid — measuring the wrong thing precisely — is worse than useless.

The validity landscape has several distinct layers, and it's essential to keep them separate. Construct validity asks whether your measurement instrument captures the theoretical concept you intend. If you measure "anxiety" with a scale that actually tracks general negative affect, your construct validity is compromised — you're studying the wrong thing. Internal validity is about causal inference: can you attribute the relationship you found to the cause you claim, or could it be a confound? This is primarily a design question (randomization, control groups, time ordering) rather than a measurement question per se. External validity asks whether findings generalize beyond your sample and context — does what you found in a lab study of U.S. undergraduates tell you anything about adults in general?

Reliability is a necessary but not sufficient condition for validity. A reliable measure produces consistent results across repeated applications — the same respondent gets the same score on different days, or different raters agree on how to score the same observation. High reliability means your measure is precise. But precision and accuracy are different things: a scale that consistently reads 10 pounds too heavy is perfectly reliable and perfectly invalid. The standard forms of reliability — test-retest reliability (consistency over time), inter-rater reliability (consistency across coders), and internal consistency (items within a scale correlating with each other) — each capture a different dimension of measurement consistency.

Validating a construct is a cumulative, evidence-based process, not a single test. Content validity asks whether the items in your instrument cover the full conceptual domain (not just one corner of it). Convergent validity asks whether your measure correlates appropriately with other measures of the same construct. Discriminant validity asks whether it fails to correlate with measures of different constructs — if your anxiety scale correlates just as highly with a depression scale as with other anxiety scales, something is off. Predictive validity asks whether your measure predicts outcomes it theoretically should. Taken together, these forms of evidence build a cumulative case that you are measuring what you claim to measure — which is the bedrock on which everything else in social science research stands.

Measurement, Validity, and Reliability

Core Idea

How It's Best Learned

Common Misconceptions

Explainer

What did you take from this?

Prerequisite Chain

Prerequisites (1)

Leads To (5)