Contemporary validity frameworks (APA/AERA/NCME Standards) organize evidence into five sources: test content, response processes, internal structure, relations to other variables, and consequences of testing. This unified view synthesizes validity as an integrated evaluation of whether test scores support their intended interpretations and uses.
Your earlier work on construct validity, criterion validity, and content validity gave you three historically separate concepts that were once treated as distinct *types* of validity — as if a test could be "criterion valid" independently of whether it was "content valid." The modern framework, codified in the *Standards for Educational and Psychological Testing* (APA/AERA/NCME), rejects this fragmentation. Validity is now understood as a single, unified property: the degree to which evidence supports the interpretation and use of test scores for a specific purpose. The five sources of evidence are not separate validity types — they are different evidentiary lines that collectively build or undermine the validity argument for a particular use.
Evidence from test content examines whether the items adequately represent the domain the test claims to measure. This is the conceptual heir to content validity — subject matter experts judge whether the test covers the right content in the right proportions. But content coverage alone cannot establish validity; a history exam might perfectly represent the curriculum and still produce scores that are uninterpretable because of poor item wording. Evidence from response processes addresses this gap: it examines whether examinees are actually using the cognitive or behavioral processes the test intends to invoke. Think-aloud protocols, eye-tracking, and cognitive interviews reveal whether a "math reasoning" item is solved through reasoning or through test-taking tricks. If examinees bypass the intended process, the score does not mean what you think it means.
Evidence from internal structure uses factor analysis and related methods (building on your measurement prerequisites) to evaluate whether the relationships among items and subscales match the theoretical model. If a test claims to measure three distinct abilities but all items load on a single factor, the three-score interpretation lacks structural support. Evidence from relations to other variables encompasses the convergent, discriminant, and criterion-related evidence you have studied separately — correlations with theoretically related and unrelated constructs, and with outcomes the test is supposed to predict. These external relationships are the most direct test of whether the score captures the intended construct.
Evidence from consequences is the most controversial source. It asks whether the actual use of the test produces the intended outcomes and does not produce harmful unintended ones. From your hypothesis testing background, you know that a statistical result is only meaningful relative to a purpose — the same is true for test validity. A test that validly predicts job performance but systematically underestimates performance for one demographic group is not simply "valid" in the abstract; the consequences of its use constitute validity evidence against its current application. This fifth source reflects validity theory's shift from asking "is this a valid test?" to asking "is this a valid use of this test with these people for this purpose?" — a fundamentally more demanding and contextual standard.