Validity is not a test property but a quality of inferences drawn from scores in a specific context. Validity evidence comes from five sources: content, response processes, internal structure, relations to external variables, and consequences. Effective interpretation requires designing validation studies that gather evidence relevant to intended uses and interpretations.
From your work on validity evidence frameworks, you know the conceptual pivot that the *Standards for Educational and Psychological Testing* (1999/2014) introduced: validity is not a fixed property of a test, but a judgment about the appropriateness of specific inferences drawn from test scores in specific contexts for specific purposes. A test of reading comprehension may yield valid inferences about reading ability while yielding invalid inferences when used to make employment decisions in a job that does not require reading. The test did not change; the inference changed. This reframing dissolves the older tripartite distinction (content validity, criterion validity, construct validity) and replaces it with a unified concept: an argument that evidence supports, or fails to support, a score interpretation.
The five sources of validity evidence define the terrain of that argument. Content evidence asks whether the test items adequately represent the domain of interest — established through expert review, content mapping, and alignment studies. Response process evidence asks whether examinees are actually doing what the test intends — established through think-aloud protocols, eye-tracking, or cognitive interviewing. A math test may be measuring reading ability instead of mathematical reasoning if the items are verbally dense; response process data can reveal this. Internal structure evidence asks whether the item relationships within the test match the hypothesized structure of the construct — established through factor analysis and IRT model fit. Relations to external variables evidence asks whether scores correlate with other measures as theory predicts — convergent correlations with measures of the same construct, discriminant correlations with measures of different constructs. Consequential evidence asks whether the use of test scores produces intended outcomes and whether unintended consequences exist.
Designing a validation study means deciding which sources of evidence are most relevant to the intended interpretation and then building a research program to gather them. Not all five sources need equal attention for every test: a straightforward knowledge assessment for a licensure exam may require principally content evidence and criterion evidence (can licensed practitioners actually do the job?), while a novel measure of an abstract psychological construct like "grit" requires heavy investment in internal structure and discriminant validity research. The interpretive argument framework (Kane, 2006) makes this structure explicit: the test developer states the chain of inferences from observed score to ultimate decision, then identifies each inference as a link, and specifies what evidence would strengthen or break each link.
The most common failure mode in test development is gathering validity evidence *after* widespread deployment, when negative findings are costly to act on. Best practice is to design the validation program before the test is used operationally: pilot data should inform both item refinement and the evidentiary argument simultaneously. If the intended interpretation is that high scorers are more qualified for a clinical position, then criterion-related studies should be designed with the hiring outcome in mind — not added retroactively when someone questions the test's use. Validity is an ongoing process of accumulation, not a one-time certification, and each new population, context, or decision changes the evidentiary requirements.