A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Causal Inference Methods in Biostatistics

Research Depth 230 in the knowledge graph ☐ I know this ☆ Set as goal

4topics build on this

1,326prerequisites beneath it

Logistic Regression in Biostatistics Study Design in Biostatistics +1 more→→Difference-in-Differences in Biostatistics Instrumental Variables in Biostatistics +1 more

Core Idea

Causal inference in biostatistics formalizes the question "does X cause Y?" using the potential outcomes framework (Rubin causal model): each subject has a potential outcome under treatment Y(1) and under control Y(0), but only one is observed — the fundamental problem of causal inference. The average treatment effect (ATE) is E[Y(1) - Y(0)]. In randomized trials, randomization ensures that observed treatment groups estimate potential outcomes without bias. In observational studies, confounding (common causes of treatment and outcome) prevents direct causal interpretation. Causal inference methods — propensity scores, instrumental variables, difference-in-differences, regression discontinuity — each address confounding under different assumptions. Directed acyclic graphs (DAGs) provide a visual language for encoding causal assumptions and identifying what must be adjusted for to estimate causal effects.

Explainer

The goal of causal inference is to determine whether a treatment or exposure causes a change in an outcome — not merely whether the two are associated. From your study of study design, you know that randomized experiments provide the strongest evidence for causation. The potential outcomes framework explains why: each subject has two potential outcomes, Y(1) under treatment and Y(0) under control. The causal effect for that individual is Y(1) - Y(0). The "fundamental problem of causal inference" is that we observe only one of these — a person either receives the treatment or does not, never both simultaneously.

Randomization solves this at the population level by ensuring that the group of treated subjects is a representative sample of the population's Y(1) values, and the control group samples Y(0). The difference in group means estimates the Average Treatment Effect (ATE): E[Y(1)] - E[Y(0)]. This works because random assignment makes treatment independent of all patient characteristics — measured and unmeasured — eliminating confounding.

In observational studies, treatment is not randomly assigned — patients who receive a treatment may differ systematically from those who do not. Confounders (variables that cause both treatment and outcome) create spurious associations. Directed acyclic graphs (DAGs) provide a rigorous visual language for representing causal relationships and identifying what must be controlled for. The backdoor criterion states that the causal effect of X on Y is identified if you condition on a set of variables that blocks all backdoor paths (non-causal paths from X to Y through confounders). DAGs also reveal what you should not condition on: colliders (variables caused by both treatment and outcome), which introduce bias when conditioned upon, and mediators (variables on the causal path from treatment to outcome), which absorb the very effect you are trying to estimate.

The various causal inference methods — propensity scores, instrumental variables, difference-in-differences, regression discontinuity — each address confounding under different assumptions about which variables are observed and how treatment assignment works. No method eliminates the need for assumptions; each makes different untestable assumptions transparent. Propensity scores assume no unmeasured confounders. Instrumental variables assume the existence of a variable that affects treatment but not outcome directly. Difference-in-differences assumes parallel trends. The choice of method depends on the data structure and the plausibility of its specific assumptions.

Practice Questions 4 questions