A Receiver Operating Characteristic (ROC) curve plots sensitivity (true positive rate) against 1 - specificity (false positive rate) across all possible classification thresholds for a continuous diagnostic test or prediction model. Each point on the curve represents a different threshold, tracing the tradeoff between detecting true positives and generating false positives. The Area Under the ROC Curve (AUC) summarizes overall discriminative ability: AUC = 0.5 indicates no discrimination (equivalent to random guessing), AUC = 1.0 indicates perfect discrimination. AUC has a concordance interpretation — it equals the probability that a randomly chosen diseased individual has a higher test value than a randomly chosen non-diseased individual. ROC analysis separates discrimination (can the model distinguish cases from non-cases?) from calibration (are the predicted probabilities accurate?).
From diagnostic test evaluation, you know that any test with a continuous measurement (blood glucose, tumor marker, risk score) requires a threshold to classify subjects as positive or negative. Lowering the threshold increases sensitivity (you catch more true cases) but decreases specificity (you also flag more healthy people). Raising the threshold does the opposite. The ROC curve displays this entire tradeoff at once by plotting sensitivity (y-axis) against 1 - specificity (x-axis) as the threshold sweeps from its minimum to maximum value.
The ROC curve always starts at (0, 0) — the highest possible threshold where everything is classified as negative (zero sensitivity, perfect specificity) — and ends at (1, 1) — the lowest possible threshold where everything is positive (perfect sensitivity, zero specificity). A perfect test has a curve that shoots straight up to (0, 1) and then across to (1, 1), hugging the upper-left corner. A useless test lies along the diagonal from (0, 0) to (1, 1), because raising sensitivity and raising the false positive rate occur at the same rate — the test contains no information.
The AUC collapses the entire curve into a single number. It has an elegant probabilistic interpretation: AUC equals the probability that a randomly chosen diseased subject has a higher test value than a randomly chosen non-diseased subject. An AUC of 0.90 means that 90% of all case-control pairs are correctly ordered by the model. This makes AUC a natural measure of discrimination — the model's ability to rank subjects by risk. Conventional benchmarks (though context-dependent) consider AUC of 0.7-0.8 as acceptable, 0.8-0.9 as excellent, and above 0.9 as outstanding.
However, AUC has limitations. It summarizes performance across all thresholds, including many that are clinically irrelevant. If you only care about high-sensitivity operating points (screening tests), the part of the ROC curve at low sensitivity is irrelevant but still contributes to the AUC. Two models with identical AUC can have very different performance at the threshold you would actually use. Furthermore, AUC measures discrimination but not calibration — whether the predicted probabilities are accurate. A model that assigns probability 0.8 to everyone with disease and 0.6 to everyone without has perfect discrimination (AUC = 1.0) but terrible calibration. For clinical decisions based on absolute risk thresholds, both discrimination and calibration matter, and AUC alone is insufficient.
No topics depend on this one yet.