A recidivism prediction model is well-calibrated across racial groups — among defendants who receive a 70% risk score, roughly 70% actually reoffend, regardless of race. However, the model's false positive rate is higher for Black defendants than for White defendants. Which of the following is true?
AThe model satisfies both calibration and equalized odds
BThe model satisfies calibration but violates equalized odds
CThe model violates calibration because error rates differ across groups
DThe model must be retrained — both calibration and equalized odds can always be satisfied simultaneously
Calibration requires that predicted probabilities correspond to actual outcome rates, regardless of group — this is satisfied here. Equalized odds requires that both false positive and false negative rates are equal across groups — the unequal false positive rates mean this is violated. Importantly, option D is wrong: the Chouldechova-Kleinberg impossibility theorem proves that when base rates differ across groups, you cannot simultaneously satisfy calibration and equal error rates. This is the real-world tension exposed in the 2016 ProPublica/Northpointe COMPAS debate.
Question 2 Multiple Choice
Why does demographic parity have a fundamental limitation as a fairness criterion for a medical diagnosis model?
ADemographic parity is too computationally expensive to enforce for medical models
BIt requires equal positive prediction rates across groups, which would force the model to either over-predict for low-prevalence groups or under-predict for high-prevalence groups
CMedical models are exempt from fairness requirements under HIPAA regulations
DDemographic parity only measures false positives, ignoring the impact of false negatives on patient care
If disease rates genuinely differ between groups (e.g., a condition is more prevalent in older adults), enforcing equal positive prediction rates forces the model to make incorrect predictions for someone. Either it over-predicts for the low-prevalence group (unnecessary interventions) or under-predicts for the high-prevalence group (missed diagnoses). Demographic parity ignores whether predictions are *correct* — it only measures rates. A model can satisfy demographic parity while being less accurate for both groups than a model that ignores group membership entirely.
Question 3 True / False
A machine learning model that satisfies demographic parity necessarily also satisfies equalized odds.
TTrue
FFalse
Answer: False
Demographic parity requires equal positive prediction rates across groups. Equalized odds requires equal true positive *and* false positive rates (i.e., the model makes correct and incorrect predictions at the same rates for each group). These are distinct definitions. A model can have equal prediction rates while having very different error structures — for example, one group's positives could all be true positives while another group's include many false positives. Satisfying one definition says nothing about the other.
Question 4 True / False
When base rates of the target outcome differ between groups, it is mathematically impossible to simultaneously achieve calibration, equal false positive rates, and equal false negative rates.
TTrue
FFalse
Answer: True
This is the Chouldechova-Kleinberg impossibility theorem. If Group A has a higher base rate of the positive outcome than Group B, then a calibrated classifier will necessarily assign higher predicted probabilities to Group A members, which means equalizing error rates while maintaining calibration is mathematically impossible. One of the three properties must be sacrificed. This is not a failure of engineering — it is a mathematical constraint, which is why the choice of fairness metric must be a normative decision, not a technical one.
Question 5 Short Answer
Why must the choice of fairness metric depend on the application context rather than being defined universally for all machine learning systems?
Think about your answer, then reveal below.
Model answer: Different applications assign different costs to different types of errors. In criminal justice, a false positive (wrongly predicting reoffending) restricts an innocent person's liberty — making equal false positive rates a priority. In medical screening, a false negative (missing a disease) may be fatal — making equal true positive rates (equal opportunity) more important. In lending, calibration may be legally required to prevent redlining. Because the Chouldechova-Kleinberg impossibility theorem shows these definitions cannot all be satisfied at once when base rates differ, the choice of which fairness property to prioritize is an ethical judgment about which type of error is more harmful — a question that cannot be answered by mathematics alone.
This is the central practical lesson of fairness in ML: there is no 'default' fairness criterion. Each definition encodes a value judgment about what counts as fair, and different stakeholders in different domains may reasonably disagree. Pre-processing, in-processing, and post-processing interventions all optimize for whichever metric the designer selects. A system that appears 'fair' by one definition may appear systematically discriminatory by another — which is why transparency about which fairness metric was chosen, and why, is essential.