What is the key difference between correlation and causation, and why does standard ML (which learns correlations) fail to capture causation?
Think about your answer, then reveal below.
Model answer: Correlation describes association: P(Y|X) is high if X and Y co-occur frequently. Causation describes intervention: P(Y|do(X)) is the probability of Y if we intervene to set X to a value. The difference emerges when confounders (variables that influence both X and Y) exist. For example, ice cream sales and drowning deaths are correlated (both increase in summer), but neither causes the other; summer weather is a confounder. Standard ML learns correlations from data, but confounders break the causal interpretation: increasing ice cream sales does not save lives. Causal inference requires additional assumptions (causal graphs specifying confounders, randomization) and specialized methods to estimate P(Y|do(X)) from observational data.
The correlation-causation distinction is fundamental. ML practitioners must recognize when they are estimating correlation vs. causation and choose methods accordingly. Causal inference is necessary for policy/treatment decisions where we care about intervention effects, not just prediction.
Question 2 Multiple Choice
In causal graphs, a confounder is a variable that influences both the treatment and outcome. How does confounding bias causal effect estimates from observational data?
AConfounders have no effect on causal estimates; they are irrelevant to do-calculus
BConfounders induce spurious correlation between treatment and outcome, biasing effect estimates if not controlled for
CConfounders always increase the estimated effect size, never decrease it
DConfounders are automatically handled by any regression model
Confounders create non-causal association between treatment and outcome. If a confounder C influences both T (treatment) and Y (outcome), then P(Y|T) will be inflated: T and Y are correlated partly due to C, not (only) due to a causal effect of T on Y. For example, in health studies, age is a confounder: older patients are both more likely to receive treatment (physicians prescribe more for elderly) and more likely to have adverse outcomes (age-related decline). Comparing outcomes between treated and untreated without controlling for age conflates the treatment effect with age effects. Causal methods (matching, stratification, inverse probability weighting) condition on confounders to isolate the causal effect.
Question 3 Multiple Choice
Pearl's do-calculus provides rules for computing interventional distributions P(Y|do(X)) from observational distributions P(Y|X). In what situation can you compute the causal effect from observational data alone?
BWhen all confounders are measured and the causal graph is known, satisfying the 'backdoor criterion'; then causal effects can be estimated by conditioning on confounders
CWhen X has no confounders; then P(Y|do(X)) = P(Y|X)
DWhen sample size is large; large data is sufficient to infer causation
The backdoor criterion, developed by Pearl, specifies when causal effects are identifiable from observational data: if all confounders are measured and the causal graph is correctly specified, you can estimate the causal effect by conditioning on confounders. For example, if age is the only confounder of treatment-outcome, stratifying by age removes confounding, and the treatment effect is estimable. If unmeasured confounders exist, even perfect data and a known graph will not identify the causal effect; you need additional assumptions (instrumental variables, regression discontinuity) or experiments.
Question 4 True / False
Inverse Probability Weighting (IPW) is a method for estimating causal effects from observational data. The weights are typically inverse propensity scores. Why reweight rather than just condition?
TTrue
FFalse
Answer: True
IPW and conditioning (stratification) both control for confounders but have different properties. Conditioning (matching on confounders) is intuitive but can lead to sparse data in high-dimensional confounders. IPW reweights observations: units with low propensity score for their observed treatment are upweighted, creating a pseudo-population where treatment is independent of confounders (by design). IPW is efficient for high-dimensional confounders but can be unstable if propensity scores are extreme (some units have very low probability of their observed treatment). Doubly robust methods combine both, improving efficiency and robustness.