Questions: Test-Retest Reliability and Temporal Stability
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A researcher develops a measure of 'current anxiety level' and finds a test-retest correlation of 0.25 over a four-week interval. They conclude the measure is unreliable. What is the most important alternative interpretation?
AThe measure lacks internal consistency among its items
BThe low correlation may reflect genuine fluctuation in anxiety over four weeks rather than measurement error, because anxiety is a state, not a stable trait
CThe retest interval was too short to detect true reliability
DThe sample was too homogeneous to produce a meaningful correlation
Test-retest reliability is only the appropriate reliability index when the construct being measured is theorized to be stable across the interval. Anxiety state is explicitly designed to fluctuate with circumstances — a four-week drop in anxiety could reflect real life changes, treatment effects, or natural variation. Calling this 'unreliability' confounds measurement error with genuine construct change. The correct strategy for state measures is to use an interval short enough that real change is unlikely, or to use internal consistency (alpha) as the reliability estimate instead.
Question 2 Multiple Choice
A personality scale administered to the same people six months apart yields a stability coefficient of 0.88. What can you confidently conclude from this result alone?
AThe scale measures the intended personality construct accurately (high validity)
BThe scale's items are highly intercorrelated (high internal consistency)
CPeople's scores on this scale are highly stable across a six-month interval (high temporal stability)
DThe scale would show equally high stability over a six-year interval
A high stability coefficient tells you that scores are consistent over time — temporal stability. It says nothing about whether the test is measuring what it claims to measure (validity), whether the items cohere with each other (internal consistency), or whether stability generalizes to different intervals. A measure can be perfectly stable over six months while measuring the wrong construct entirely. Reliability, including test-retest reliability, is necessary but not sufficient for validity.
Question 3 True / False
Very short retest intervals (hours or days) can artificially inflate stability coefficients because participants remember their previous responses and anchor to them.
TTrue
FFalse
Answer: True
This carry-over effect is a major threat to validity in test-retest studies. When participants recall how they responded previously, they tend to give similar answers — not because the construct is stable, but because of memory. This produces inflated correlations that overestimate true temporal stability. The solution is to use intervals long enough for specific item responses to fade from memory, but not so long that genuine construct change becomes the dominant source of variance.
Question 4 True / False
Demonstrating high test-retest reliability over six months is sufficient evidence that a psychological measure is both reliable and valid.
TTrue
FFalse
Answer: False
High test-retest reliability proves only that the measure is stable over time — that it is consistently measuring *something*. It provides no evidence about whether that something is the intended construct. A scale claiming to measure extroversion might correlate 0.90 with itself over six months while correlating 0.10 with actual extroverted behavior. Reliability is a prerequisite for validity, not a proxy for it. Validity requires additional evidence — convergent, discriminant, and criterion-related — beyond stability alone.
Question 5 Short Answer
Why does the length of the retest interval fundamentally affect the interpretation of a stability coefficient, and what principle should guide the choice of interval for a measure of a stable personality trait?
Think about your answer, then reveal below.
Model answer: The stability coefficient conflates measurement error with genuine construct change — and the relative contribution of each depends entirely on the interval. Over a very short interval, memory effects inflate the coefficient while genuine change is minimal. Over a very long interval, real developmental or environmental change deflates it, even if the measurement itself is perfectly reliable. For a stable personality trait, the interval should be long enough that carry-over memory effects are negligible but short enough that true developmental change in the trait is not expected. For personality traits theorized to be stable across the adult lifespan, intervals of 6 months to 2 years are typical — long enough for memory to fade, short enough that life-stage change is modest for most individuals.
The key insight is that 'stability coefficient of 0.85' communicates entirely different information depending on whether the interval is two weeks or two years. Research must always report the interval and justify its selection relative to the theoretical rate of change in the construct. Without this, the coefficient cannot be meaningfully interpreted.