Missing data is ubiquitous in health research — patients miss clinic visits, questionnaire items are left blank, lab values are not drawn. Simple approaches (complete-case analysis, single-value imputation) either waste data or underestimate uncertainty. Multiple imputation (MI) addresses both problems by creating m (typically 20-50) complete datasets, each with missing values replaced by plausible values drawn from the predictive distribution of the missing data given the observed data. Each dataset is analyzed separately with standard methods, and the results are combined using Rubin's rules, which properly account for both within-imputation uncertainty (sampling variability) and between-imputation uncertainty (uncertainty about the missing values). MI is valid under the Missing at Random (MAR) assumption — that missingness depends on observed data but not on the missing values themselves — which is weaker than the Missing Completely at Random (MCAR) assumption required by complete-case analysis.
Missing data is not just an inconvenience — it is a structural problem that can invalidate study conclusions if handled incorrectly. The three missing data mechanisms defined by Rubin (1976) determine what is at stake. MCAR (Missing Completely at Random) means missingness is unrelated to any data, observed or unobserved — like a lab machine randomly failing. MAR (Missing at Random) means missingness depends on observed data but not on the missing values — sicker patients (identified by observed severity scores) are more likely to drop out, but among patients with the same severity, missingness is random. MNAR (Missing Not at Random) means missingness depends on the missing values themselves — patients with the worst outcomes are the ones who stop coming back.
Complete-case analysis — analyzing only subjects with no missing data — is valid only under MCAR. Mean imputation, last-observation-carried-forward, and other single-imputation methods either introduce bias or underestimate uncertainty (or both). Multiple imputation was developed to handle MAR data while properly quantifying the additional uncertainty caused by missingness.
The MI procedure has three steps. Imputation: a statistical model predicts missing values based on observed data, drawing from the predictive distribution to create m complete datasets. Each dataset has different imputed values, reflecting uncertainty about the true values. Analysis: each complete dataset is analyzed with the standard method (regression, survival analysis, etc.), producing m sets of estimates and standard errors. Pooling: Rubin's rules combine the m results. The pooled point estimate is the average of the m estimates. The total variance includes both the within-imputation variance (average of the m variance estimates) and the between-imputation variance (variance of the m point estimates, scaled by (1 + 1/m)). The between-imputation component captures exactly the uncertainty due to not knowing the missing values.
The imputation model is critical and must be at least as rich as the analysis model — it should include all variables in the analysis model, auxiliary variables correlated with the missing data or the missingness mechanism, and the outcome variable. An imputation model that omits important predictors will produce biased imputations. Modern implementations (mice in R, mi in Stata) use chained equations (MICE/FCS) that iterate through conditionally specified models for each variable with missing data, accommodating mixtures of continuous, binary, and categorical variables. The practical guidance is: include everything plausibly related to the missing data or the missingness mechanism, use at least 20 imputations, and perform sensitivity analyses for MNAR.
No topics depend on this one yet.