A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Instrumental Variables in Biostatistics

Research Depth 232 in the knowledge graph ☐ I know this ☆ Set as goal

1,328prerequisites beneath it

Causal Inference Methods in Biostatistics Propensity Score Methods→

Core Idea

Instrumental variables (IV) in biostatistics provide causal estimates when unmeasured confounding is present — the situation where propensity scores fail. An instrument Z must satisfy three conditions: (1) relevance — Z is associated with the treatment X, (2) independence — Z is not associated with unmeasured confounders, and (3) the exclusion restriction — Z affects the outcome Y only through X. In biostatistics, the most prominent application is Mendelian randomization, which uses genetic variants as instruments: genetic variants are randomly allocated at conception (natural randomization), are generally not confounded by lifestyle or socioeconomic factors, and affect outcomes only through the biological pathway they influence. IV estimates a Local Average Treatment Effect (LATE) — the causal effect for "compliers" whose treatment is shifted by the instrument, not for the entire population.

Explainer

Propensity score methods assume that all confounders are measured — a strong assumption that is often implausible. If physician prescribing decisions are based partly on clinical judgment that is not captured in the data, propensity scores cannot eliminate this confounding. Instrumental variables offer an alternative approach that can produce causal estimates even with unmeasured confounders, provided a valid instrument exists.

The logic of IV is intuitive: find a source of variation in treatment that is "as good as random" — independent of the confounders. If the instrument shifts treatment assignment quasi-randomly, comparing outcomes between those who were shifted toward treatment and those shifted away provides a causal estimate. The instrument acts as a natural experiment embedded within the observational data. The classic biostatistical example is Mendelian randomization (MR), which exploits the random assortment of genetic variants during meiosis. A genetic variant that affects alcohol metabolism creates natural variation in alcohol consumption that is independent of the socioeconomic and behavioral factors that confound observational studies.

The three IV assumptions must all hold. Relevance (the instrument predicts treatment) is testable — regress treatment on the instrument and check the F-statistic. Independence (the instrument is not confounded with the outcome) is supported by the biology of Mendelian inheritance but can be violated by population stratification or dynastic effects. The exclusion restriction (the instrument affects the outcome only through the treatment) is the untestable and most controversial assumption. In MR, this is violated by pleiotropy — when the genetic variant affects the outcome through biological pathways other than the exposure of interest.

The IV estimate has a specific causal interpretation: the Local Average Treatment Effect (LATE). It applies to "compliers" — the subpopulation whose treatment would change if the instrument changed. In MR, these are people whose alcohol consumption is actually modified by the genetic variant. The LATE may differ from the ATE if treatment effects are heterogeneous. A genetic variant that slightly reduces moderate drinking yields a LATE for moderate drinkers, which may not match the effect of moving from heavy drinking to abstinence. Understanding what population your IV estimate describes is as important as getting the mechanics right.

Practice Questions 3 questions