The propensity score is the probability of receiving treatment given observed covariates: e(X) = P(Treatment = 1 | X). Rosenbaum and Rubin (1983) proved that conditioning on the propensity score balances all observed covariates between treatment groups, reducing a high-dimensional confounding adjustment problem to a single dimension. Propensity scores can be used via matching (pairing treated and control subjects with similar scores), stratification (grouping subjects into propensity score strata), inverse probability of treatment weighting (IPTW, weighting each subject by the inverse of their probability of receiving their actual treatment), or covariate adjustment. All approaches assume no unmeasured confounding (strongly ignorable treatment assignment): after conditioning on observed covariates, treatment assignment is independent of potential outcomes. This assumption is untestable and is the primary limitation of all propensity score methods.
Randomized trials balance confounders by design, but many important clinical questions cannot be studied with randomization (it is unethical to randomize patients to smoking or not). Observational data are abundant but confounded — patients who receive treatment differ systematically from those who do not. If patients prescribed statins are older, sicker, and have higher cholesterol, a naive comparison of outcomes between statin users and non-users conflates the treatment effect with the confounding effects of age, severity, and cholesterol.
The propensity score collapses all measured confounders into a single number: the estimated probability of receiving treatment. Two patients with the same propensity score may differ on individual covariates but are equally likely to have been treated, given their observed characteristics. Comparing outcomes between treated and untreated subjects with similar propensity scores is analogous to comparing within strata of a randomized trial (where treatment probability is 0.5 for everyone). The key theorem (Rosenbaum and Rubin, 1983) proves that balancing on the propensity score is sufficient to balance all the observed covariates that went into its estimation.
The four implementation strategies have different practical tradeoffs. Matching pairs treated and untreated subjects with similar propensity scores, creating a balanced sample but potentially excluding subjects without good matches (reducing sample size and generalizability). Stratification divides the sample into propensity score quantiles and estimates the treatment effect within each stratum. IPTW weights each subject by the inverse of their probability of receiving the treatment they actually received, creating a pseudo-population where treatment is independent of observed confounders — it uses all subjects but can be unstable when propensity scores are extreme. Covariate adjustment includes the propensity score as a covariate in a regression model, which is the simplest approach but relies on correct specification of the outcome model.
The critical limitation is that propensity scores address only measured confounders. If an important confounder is not included in the propensity model — because it was not measured or not recognized as a confounder — the treatment effect estimate remains biased. This is why sensitivity analyses (e.g., Rosenbaum bounds, E-values) are essential: they quantify how strong an unmeasured confounder would need to be to explain away the observed effect. A large, robust effect that survives sensitivity analysis is more credible than a small effect that could be explained by even modest unmeasured confounding.