← Graph View All Domains

A topic in the Open Knowledge Graph — a free, open map of 15,290 topics and the order to learn them in.

Query Cardinality and Selectivity Estimation

College Depth 89 in the knowledge graph ☐ I know this ☆ Set as goal

1topic build on this

492prerequisites beneath it

See this on the map →

Query Optimization SQL: WHERE Clause and Filtering→→Table Statistics, Histograms, and Column Statistics

Core Idea

Cardinality estimation predicts how many rows result from query operations to guide optimizer decisions. Selectivity is the fraction of rows passing a condition (e.g., age > 18 might have selectivity 0.3). The optimizer combines estimates from individual operations and uses data distribution statistics. Accurate estimates are critical for good plan selection; errors of 2-3x are common but errors of 100x+ cause terrible plans.

Explainer

You already know that the query optimizer chooses between different execution plans — sequential scans, index lookups, various join algorithms — to find the fastest way to answer a query. But how does it decide? The answer is cardinality estimation: the optimizer's prediction of how many rows will flow through each step of a plan. If it expects a filter to return 10 rows, an index lookup makes sense. If it expects 10 million rows, a full table scan is cheaper. The entire cost model rests on these row-count predictions.

Selectivity is the fundamental unit of estimation. It represents the fraction of rows that satisfy a given condition. A filter like `status = 'active'` on a table with 1 million rows might have selectivity 0.4, meaning the optimizer estimates 400,000 rows will pass. For equality conditions on columns with uniform distribution, selectivity is simply 1/NDV (number of distinct values). For range conditions like `age > 30`, selectivity depends on knowing how values are distributed — which is why databases collect statistics like histograms, most-common-value lists, and null fractions.

The real challenge arises when the optimizer must combine selectivity estimates across multiple conditions. For `WHERE age > 30 AND city = 'Denver'`, the standard assumption is independence: multiply the individual selectivities together. If age > 30 has selectivity 0.6 and city = 'Denver' has selectivity 0.05, the combined estimate is 0.03 — 3% of rows. This independence assumption is often wrong (age and city may correlate), but without multi-column statistics it is the best the optimizer can do. Correlated predicates are one of the most common sources of severe estimation errors.

Estimation errors compound through the plan. A 3x overestimate at a filter feeds into the join above it, which uses that inflated number to pick a join algorithm — perhaps choosing a hash join when a nested-loop join on a small result set would have been far faster. This cascading effect explains why a single bad selectivity estimate can make a query run 100x slower than optimal. When you encounter a mysteriously slow query, examining the optimizer's cardinality estimates (via EXPLAIN ANALYZE or equivalent) and comparing them to actual row counts is often the fastest path to diagnosis. The gap between estimated and actual rows points directly to the broken assumption.

Practice Questions 5 questions

Prerequisite Chain

Understanding Zero → The Number Zero → Counting to Five → Counting to 10 → Counting to 20 → Counting a Set of Objects Up to 20 → Cardinality: The Last Number Counted → Matching Numerals to Quantities → Subitizing Small Quantities → Addition Within 10 → Number Bonds to 10 → Addition Within 20 → Doubles and Near Doubles → Doubles Facts Within 10 → Near Doubles Facts Within 20 → Mental Math Strategies for Addition → Mental Math: Adding and Subtracting Tens → Addition Within 100 → Repeated Addition as Multiplication → Multiplication as Equal Groups → Multiplication: Arrays → Basic Multiplication Facts (0s, 1s, 2s, 5s, 10s) → Multiplication Facts Within 100 → Division as Equal Sharing → Division as Grouping (Measurement Division) → Division: Grouping (Repeated Subtraction) Model → Division: Fair Sharing Model → Division as Equal Sharing → Division as Grouping → Basic Division Facts → Division Facts Within 100 → Multiplication and Division Fact Families → Relationship Between Multiplication and Division → Division Facts as Inverse of Multiplication → Remainders and Quotients in Division → Division Word Problems → Multi-Step Word Problems → Solving Multi-Step Word Problems → Multiplication Word Problems → Division Word Problems → Introduction to Long Division → Factors and Multiples → Prime and Composite Numbers → Equivalent Fractions → Relating Fractions and Decimals → Decimal Place Value → Integers and the Number Line → Comparing and Ordering Integers → Absolute Value → Adding Integers → Subtracting Integers → Multiplying Integers → Introduction to Exponents → Order of Operations → Integer Order of Operations → Variable Expressions → The Distributive Property → Variables and Expressions Review → Introduction to Polynomials → Adding and Subtracting Polynomials → Multiplying Polynomials → Factorial → Permutations → Combinations → Counting Principles: Addition and Multiplication Rules → Introduction to Graph Theory → Propositional Logic Foundations → Logical Equivalences → Boolean Algebra → Boolean Type and Truth Values → Comparison Operators and Boolean Tests → Logical Operators and Boolean Algebra → Conditional Statements → Defining and Calling Functions → Functions: Decomposing Problems → Function Parameters and Argument Passing → Return Values → Variable Scope → Introduction to Classes → Objects and Instances → Methods and Attributes → Algorithm Design Basics → Asymptotic Notation: Big-O, Big-Omega, Big-Theta → Big-O Notation and Complexity Analysis → Time and Space Complexity → Binary Search → Binary Search Trees → B-Tree Indexes → Query Optimization → Query Cardinality and Selectivity Estimation

Longest path: 90 steps · 492 total prerequisite topics

Prerequisites (2)

Query Optimizationhard SQL: WHERE Clause and Filteringhard

Leads To (1)

Table Statistics, Histograms, and Column Statisticshard