The standard error of measurement (SEM) quantifies the amount of error in an individual test score due to measurement imprecision, computed as SEM = SD × √(1 - reliability). It is used to construct confidence intervals around observed scores to estimate a range containing the person's true score with specified confidence (e.g., 95%). Understanding SEM is essential for avoiding overinterpretation of small score differences.
Begin with the conceptual link between reliability and error variance. Practice computing SEM values for tests with different reliability coefficients, then construct and interpret confidence intervals around actual test scores. Explore how confidence intervals widen with lower reliability and narrower measurement precision.
From your study of reliability in measurement, you know that no psychological test is perfectly consistent—every observed score contains some measurement error. The question is not whether error exists, but how large it is and what it means for interpretation. The standard error of measurement (SEM) gives you a direct, interpretable answer: it tells you, in the original score units, how much an individual's observed score is likely to deviate from their hypothetical true score (the score they would receive if the test were perfectly reliable and infinitely long). Smaller SEM means more precise measurement; larger SEM means the observed score is a noisier estimate of the true score.
The formula is elegant: SEM = SD × √(1 − reliability). Two things are immediately apparent. First, SEM is anchored in the standard deviation of the score distribution—a test with a wider score range will have a larger SEM in absolute terms even at the same reliability level. Second, SEM is directly tied to reliability: a perfectly reliable test (reliability = 1.0) has SEM = 0, while a completely unreliable test (reliability = 0) has SEM equal to the full standard deviation of scores. Most real tests fall between these extremes. A test with SD = 15 and reliability = 0.90 has SEM = 15 × √(0.10) ≈ 4.7 points, meaning a measured IQ of 115 could reflect a true score anywhere in a meaningful range around that value.
This range is made explicit with confidence intervals. Using the SEM as the standard deviation of the error distribution (which classical test theory assumes to be approximately normal), you can compute the interval within which the true score likely falls. The 68% confidence interval spans one SEM above and below the observed score; the 95% interval spans approximately 1.96 × SEM. For the IQ example above (SEM ≈ 4.7), the 95% confidence interval around a score of 115 is roughly 115 ± 9.2, or [106, 124]. This interval quantifies the uncertainty in the measurement and is indispensable for avoiding overinterpretation—claiming that a score of 115 is definitively higher than a score of 112 would be unjustified given the measurement error in both scores.
A critical distinction worth reinforcing: the SEM is about individual score precision, not about sample means. The standard error of the mean (which you encountered in inferential statistics) quantifies uncertainty about a group average across replications of sampling. The SEM quantifies uncertainty about a single person's score across replications of testing. They share a name fragment but answer different questions: "How precisely have we estimated the population mean?" (standard error of the mean) versus "How precisely have we measured this person?" (standard error of measurement). Conflating them leads to incorrect inferences about both individuals and groups.
One important refinement: classical test theory assumes the SEM is constant across the full score range, but this is an approximation. In reality—and especially in IRT-based measurement—precision varies by score level. A test calibrated to measure average ability will be more precise near the middle of the score distribution and less precise at the extremes, where fewer items are targeting examinees' ability level. When interpreting scores at the tails of the distribution, a wider uncertainty range may be warranted even if the reported reliability is high. This is one reason modern adaptive tests and IRT-based systems compute conditional standard errors of measurement that vary across the ability continuum rather than applying a single SEM to all scores.