A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Data Preparation, Screening, and Quality Assurance

College Depth 115 in the knowledge graph ☐ I know this ☆ Set as goal

47topics build on this

568prerequisites beneath it

Missing Data Mechanisms, Patterns, and Handling Methods Survey Design, Construction, and Administration +1 more→→Descriptive Statistics and Data Visualization

Core Idea

Before analysis, data must be checked for entry errors, missing values, outliers, and assumption violations. Missing data mechanisms (missing completely at random vs. missing at random) affect appropriate handling. Outliers require investigation—are they errors, genuine extreme values, or violations of assumptions? Data cleaning documentation ensures transparency and reproducibility.

How It's Best Learned

Conduct exploratory data analysis on a dataset: describe distributions, identify missing patterns, investigate outliers. Practice multiple imputation for missing data. Discuss how data preparation decisions can influence downstream results.

Common Misconceptions

Data cleaning is optional if sample size is large; - Outliers should always be removed; - Missing data can be ignored if < 5%; - Transformation of variables is data manipulation.

Explainer

Data analysis is only as trustworthy as the data it operates on — and raw data almost never arrives clean. Before running any statistical model, you need to understand what you actually have: how it was collected, where it might have gone wrong, and what decisions you made to handle its imperfections. This is data preparation and quality assurance, and it is not a formality — the choices made here can meaningfully change your conclusions.

Start with the basics: entry errors and range violations. A participant age recorded as 220, a Likert response of 9 on a 1–7 scale, or a reaction time of –200ms are not plausible. These require verification against original records or flagging for exclusion. Then examine distributions: a variable that should be approximately normal but is heavily skewed might indicate a recording error, a floor or ceiling effect, or a genuine distributional feature that violates assumptions of downstream parametric tests. Plotting histograms and running descriptives (mean, median, range, kurtosis) is not busywork — it is your first look at the actual structure of the data.

Missing data is where the methodological stakes rise. The key distinction comes from the *mechanism* of missingness. Missing completely at random (MCAR) means the probability of missingness is unrelated to anything — data are missing as if by random deletion. This is the least damaging because listwise deletion (dropping incomplete cases) produces unbiased estimates, just with reduced power. Missing at random (MAR) means missingness is related to observed variables but not to the missing values themselves — for example, men are more likely to skip depression items, but among men, those who skip don't differ systematically from those who respond. MAR allows valid imputation using other variables. Missing not at random (MNAR) is the most problematic: people with severe depression skip depression items precisely because they're severely depressed. Here, any analysis ignoring missingness is potentially biased, and the problem cannot be fully solved from the observed data alone.

Outliers require investigation, not reflexive deletion. An extreme value might be a genuine data-entry error (delete or correct it), a legitimate unusual case (consider whether your research question applies to such cases), or an influential observation that reveals a model misspecification (investigate the model, not just the point). Running analyses with and without outliers and reporting both sets of results is often more informative than any single decision rule. Similarly, variable transformations — taking the log of a skewed distribution, standardizing variables before analysis — are not manipulations in the pejorative sense; they are adjustments to better satisfy model assumptions. The test of whether a transformation is appropriate is whether it makes substantive sense and whether you declare it transparently in your methods section. Every data preparation decision should be documented: what you found, what you did, and why. This documentation is not optional overhead — it is what separates reproducible science from analysis that cannot be audited or replicated.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Dividing Integers → Unit Rates → Proportions → Percent Concept → Converting Between Fractions, Decimals, and Percents → Operations with Rational Numbers → Two-Step Equations → Solving Multi-Step Equations → Equations with Variables on Both Sides → Angle Pairs: Complementary, Supplementary, and Vertical → Parallel Lines and Transversals → Corresponding Angles → Alternate Interior Angles → Triangle Angle Sum Theorem → Exterior Angle Theorem → Triangle Inequality Theorem → Similar Triangles: AA Similarity → Similar Triangles: SSS and SAS Similarity → Proportions in Similar Triangles → Right Triangle Trigonometry Introduction → Sine, Cosine, and Tangent Ratios → Trigonometric Ratios Review → Radian Measure → Converting Between Degrees and Radians → The Unit Circle → Graphing Sine and Cosine → Graphing Tangent and Reciprocal Trigonometric Functions → Derivatives of Trigonometric Functions → Antiderivatives → Indefinite Integrals → Basic Integration Rules → Riemann Sums → Definite Integral Definition → Probability Density Functions and Continuous Distributions → Cumulative Distribution Functions → Continuous Random Variables → Probability Density Functions → Expected Value → Weak Law of Large Numbers → Probability Axioms and Rules → Conditional Probability → Conditional Distributions → Bivariate Normal Distribution → Normal Distribution → Standard Normal Distribution and Z-Scores → Hypothesis Testing Fundamentals → Experimental Research Design → Control and Experimental Groups → Random Assignment → Confounding Variables and Internal Validity → Blinding and Demand Characteristics → Validity in Psychological Measurement → Inferential Statistics in Psychology → Effect Size and Statistical Power → Sample Size Determination in Research Planning → Literature Review and Research Synthesis → Hypothesis Construction: Directional and Nondirectional Predictions → Operationalizing Independent and Dependent Variables → Construct Definition and Measurement Development → Construct Validity and Measurement Validity → Construct Validity and Operationalization of Psychological Constructs → Variables: Definition, Operationalization, and Measurement → Systematic Observation, Behavioral Coding, and Analysis → Data Preparation, Screening, and Quality Assurance

Longest path: 116 steps · 568 total prerequisite topics

Prerequisites (3)

Survey Design, Construction, and Administrationsoft Systematic Observation, Behavioral Coding, and Analysissoft Missing Data Mechanisms, Patterns, and Handling Methodssoft

Leads To (1)

Descriptive Statistics and Data Visualizationhard