Missing Data: Mechanisms, Diagnostics, and Multiple Imputation

Graduate Depth 74 in the knowledge graph I know this Set as goal
Unlocks 2 downstream topics
missing-data imputation mcar-mar multiple-imputation

Core Idea

Missing data is ubiquitous in social research. Data can be missing completely at random (MCAR), at random given observed data (MAR), or not at random (MNAR). Each mechanism requires different handling. Multiple imputation under MAR preserves uncertainty and produces valid inference.

Explainer

Your prerequisite on regression diagnostics introduced the idea that real data often violates the clean assumptions of standard models. Missing data is one of the most common and consequential violations: when observations are incomplete, naive analysis can produce severely biased results. The key insight is that *how* data goes missing matters as much as *how much* is missing. The three mechanisms form a hierarchy of severity, and each implies a different treatment strategy.

Missing Completely at Random (MCAR) is the most benign: whether a value is missing is entirely unrelated to any variable in the dataset, observed or unobserved. A random lab malfunction destroying 5% of samples is MCAR. Listwise deletion — dropping all rows with any missing value — produces unbiased estimates under MCAR, though it reduces statistical power. Missing at Random (MAR) is more realistic and more insidious: missingness depends on observed variables but not on the missing values themselves. Older survey respondents might be less likely to report income, but their missingness depends on age (observed), not their actual income. Here, listwise deletion produces biased estimates because it drops a systematically non-random subset of observations. Your conditional probability prerequisite explains why: the dropped observations don't represent a random draw from the population, so the remaining sample is distorted. Missing Not at Random (MNAR) is the worst case: missingness depends on the unobserved value itself. High earners skip the income question *because* they earn a lot — the missingness carries information about the very thing you're trying to measure. No standard imputation method can fully correct for MNAR.

Multiple imputation is the principled solution for MAR data. Rather than substituting a single "best guess" for each missing value — a strategy called single imputation that understates uncertainty — multiple imputation generates several complete datasets, each with plausible imputed values drawn from a probability model that conditions on all observed data. This is where your probability foundations are essential: each imputed value is a draw from the conditional distribution of the missing variable given everything observed. The analysis model is run on each imputed dataset separately, and results are combined using Rubin's rules, which pool point estimates and inflate standard errors to reflect the uncertainty introduced by missingness itself. The final confidence intervals are appropriately wider than they would be with complete data — which is honest, because information was genuinely lost.

Diagnosing the missing data mechanism is crucial before choosing a method, but it is partly untestable. You can detect departures from MCAR by comparing cases with and without missing values on observed variables — if the two groups differ systematically, MCAR is violated. But distinguishing MAR from MNAR is fundamentally unidentifiable from the observed data alone, because the relevant information is by definition missing. Subject-matter knowledge about why data might be missing — survey design, participant attrition, measurement error patterns — is the primary resource here. Sensitivity analyses that model different MNAR scenarios and check how much conclusions change are the best available defense against overconfident inference when the missing data mechanism is uncertain.

What did you take from this?

Topics in reflective domains aren't scored by quiz answers. Read, reflect, and mark when you've thought it through.

Quiz me anyway →

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsOne-Sided LimitsContinuity DefinitionLimit Definition of the DerivativePower RuleConstant Multiple and Sum/Difference RulesProduct RuleChain RuleHigher-Order DerivativesConcavity and Inflection PointsSecond Derivative TestCurve SketchingOptimization ProblemsCritical Points of Multivariable FunctionsCritical Points and Classification of ExtremaSecond Partial Test for Local Extrema (Hessian)The Hessian Matrix and Second Derivative TestUnconstrained Optimization: Finding ExtremaOptimization in Multiple VariablesLinear Regression for Social ScienceRegression Diagnostics: Checking Assumptions and ViolationsMissing Data: Mechanisms, Diagnostics, and Multiple Imputation

Longest path: 75 steps · 365 total prerequisite topics

Prerequisites (4)

Leads To (2)