Questions: Confusion Matrix and Classification Metrics
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A disease affects 1% of a population. A diagnostic test achieves 99% accuracy by always predicting 'healthy' for every patient. What is this test's recall for detecting the disease?
A99% — matching its overall accuracy
B1% — equal to the disease prevalence
C0% — it correctly identifies zero sick patients
D100% — it correctly identifies all healthy patients as healthy
Recall = TP / (TP + FN). A test that always predicts 'healthy' has TP = 0 (it never correctly identifies a sick patient) and FN = every sick patient. Therefore recall = 0/(0 + all_sick) = 0%. The 99% accuracy comes entirely from correctly labeling the 99% who are healthy — the test is clinically useless for its actual purpose. This is the central lesson of the confusion matrix: overall accuracy is a dangerously misleading metric when class distributions are imbalanced. Option D describes specificity (true negative rate), not recall.
Question 2 Multiple Choice
A spam filter is evaluated on 9,200 emails: 600 spam correctly caught (TP), 400 ham misclassified as spam (FP), 8,000 ham correctly passed (TN), 200 spam that slipped through (FN). What is the filter's recall?
A0.60 — computed as TP / (TP + FP)
B0.75 — computed as TP / (TP + FN)
C0.93 — computed as (TP + TN) / total
D0.95 — computed as TN / (TN + FP)
Recall = TP / (TP + FN) = 600 / (600 + 200) = 600 / 800 = 0.75. Recall answers: 'Of all actual spam, what fraction did the filter catch?' Option A is precision (TP/(TP+FP) = 600/1000 = 0.60), which answers 'Of everything flagged as spam, how much really was?' Option C is overall accuracy = 8600/9200 ≈ 0.93. Notice that precision (0.60) and recall (0.75) diverge — making the filter more aggressive would increase recall but lower precision, illustrating the inherent tradeoff.
Question 3 True / False
A classifier with 99% accuracy is necessarily better than one with 95% accuracy for a fraud detection task where primarily 1% of transactions are fraudulent.
TTrue
FFalse
Answer: False
A model that labels every transaction as 'not fraud' achieves 99% accuracy — but catches zero fraud cases (0% recall). For a fraud detection system, the relevant metrics are precision and recall for the fraud class. A model with 95% overall accuracy that catches 80% of actual fraud has far greater practical value, despite lower aggregate accuracy. This is the core lesson of the confusion matrix: class-level metrics replace overall accuracy whenever class distributions are imbalanced or error costs are asymmetric.
Question 4 True / False
Increasing a binary classifier's classification threshold (requiring higher confidence before predicting 'positive') generally increases precision while decreasing recall.
TTrue
FFalse
Answer: True
A higher threshold makes the model more conservative — it only predicts positive when very confident. Fewer borderline cases are incorrectly called positive (FP decreases), which increases precision (TP/(TP+FP)). But more true positives fall below the stricter threshold and are missed (FN increases), which decreases recall (TP/(TP+FN)). This precision-recall tradeoff is inherent to any classifier with an adjustable threshold. The confusion matrix makes it quantifiable: adjusting the threshold shifts numbers between TP, FP, TN, and FN, changing all derived metrics simultaneously.
Question 5 Short Answer
Explain why 'accuracy' is a misleading metric for a fraud detection system where 99% of transactions are legitimate, and identify two metrics that would be more informative.
Think about your answer, then reveal below.
Model answer: Accuracy measures the fraction of all predictions that are correct, but when 99% of cases are legitimate, a model that always predicts 'legitimate' achieves 99% accuracy while catching zero fraud. More informative metrics are precision (of transactions flagged as fraud, what fraction are actually fraudulent — measuring the false alarm rate) and recall (of all actual fraud cases, what fraction was detected — measuring detection capability). These class-specific metrics expose what accuracy hides: whether the model actually works for its purpose.
The confusion matrix was designed for exactly this situation: when the cost of different error types differs. Missing fraud (false negative) costs far more than a false alarm (false positive), yet accuracy weights both errors equally. In imbalanced settings, F1-score (harmonic mean of precision and recall) is also more meaningful than accuracy. The general principle: choose metrics based on which errors matter most in the application context, not on which metric is easiest to compute.