Questions: Data Preparation, Screening, and Quality Assurance
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
In a depression study, participants with the highest depression scores are significantly more likely to skip the follow-up questionnaire. What type of missingness is this, and what is its primary implication?
AMCAR — missingness is unrelated to anything, so listwise deletion produces unbiased estimates
BMAR — missingness depends on observed variables, so multiple imputation using other variables is valid
CMNAR — missingness is related to the unobserved values themselves, meaning analyses that ignore it will likely be biased
DMCAR — because we cannot directly observe why participants skipped the questionnaire
When the probability of missingness is related to the missing value itself — severely depressed participants skip depression questions because they are severely depressed — the data are Missing Not at Random (MNAR). This is the most serious mechanism because no standard statistical technique can fully correct for it using observed data alone. Listwise deletion, mean imputation, and even multiple imputation all produce biased estimates under MNAR. The problem cannot be solved from the observed data; it requires sensitivity analyses and transparent acknowledgment.
Question 2 Multiple Choice
You discover that three participants have their age recorded as '220'. What is the most appropriate first step?
ARemove all three cases immediately to protect data integrity
BReplace each value with the sample mean age
CVerify the values against original records; correct if possible, flag for exclusion if not verifiable
DIgnore them — three impossible values cannot materially affect a large sample
An impossible value is most likely a data entry error, but the responsible action is verification rather than reflexive deletion. The original questionnaire or data record may reveal the actual value. If verification is impossible, the case should be excluded with documentation. Replacing with the mean treats a likely error as a valid observation. And ignoring extreme values — even in large samples — risks distorting distributions and violating assumptions of downstream parametric tests.
Question 3 True / False
If less than 5% of values are missing, listwise deletion generally produces unbiased estimates.
TTrue
FFalse
Answer: False
The appropriateness of listwise deletion depends on the missingness mechanism, not the proportion of missing data. If data are MNAR — even if only 1% are missing — listwise deletion produces biased estimates because the excluded cases are systematically different from those retained. The 5% threshold is a rough guideline for when missingness is unlikely to be a practical problem under MCAR, not a guarantee against bias under any mechanism.
Question 4 True / False
Documenting every data preparation decision — what was found, what was done, and why — is essential for scientific reproducibility, not optional bookkeeping.
TTrue
FFalse
Answer: True
Data preparation decisions (which outliers were removed, how missing data were handled, which variables were transformed) directly affect statistical results and can alter conclusions. Without documentation, another researcher cannot reproduce the analysis and reviewers cannot evaluate whether decisions were reasonable or introduced bias. These decisions belong in the methods section of any publication — they are part of the analytical record, not pre-analysis housekeeping.
Question 5 Short Answer
Why is it necessary to determine the mechanism of missingness (MCAR, MAR, or MNAR) before deciding how to handle missing data?
Think about your answer, then reveal below.
Model answer: Each mechanism has different implications for bias. Under MCAR, the missing cases are a random subsample, so listwise deletion is unbiased (only losing power). Under MAR, missingness is related to observed variables but not to the missing values themselves, so multiple imputation using those observed variables can restore unbiased estimates. Under MNAR, missingness is related to the unobserved value itself, and neither listwise deletion nor standard imputation is unbiased — any analysis ignoring the missingness is systematically skewed. Applying the wrong method can produce results that look complete and valid but are driven by who did not respond.
The key insight is that missing data is not just a nuisance with a default remedy. The mechanism determines whether the observed data is a representative sample of what you intended to measure. Treating all missingness the same — say, always using listwise deletion — can introduce systematic bias that inflates or deflates effect estimates, undermining the entire analysis.