Query Cardinality and Selectivity Estimation

College Depth 66 in the knowledge graph I know this Set as goal
Unlocks 1 downstream topic
cardinality selectivity estimation cost-model

Core Idea

Cardinality estimation predicts how many rows result from query operations to guide optimizer decisions. Selectivity is the fraction of rows passing a condition (e.g., age > 18 might have selectivity 0.3). The optimizer combines estimates from individual operations and uses data distribution statistics. Accurate estimates are critical for good plan selection; errors of 2-3x are common but errors of 100x+ cause terrible plans.

Explainer

You already know that the query optimizer chooses between different execution plans — sequential scans, index lookups, various join algorithms — to find the fastest way to answer a query. But how does it decide? The answer is cardinality estimation: the optimizer's prediction of how many rows will flow through each step of a plan. If it expects a filter to return 10 rows, an index lookup makes sense. If it expects 10 million rows, a full table scan is cheaper. The entire cost model rests on these row-count predictions.

Selectivity is the fundamental unit of estimation. It represents the fraction of rows that satisfy a given condition. A filter like `status = 'active'` on a table with 1 million rows might have selectivity 0.4, meaning the optimizer estimates 400,000 rows will pass. For equality conditions on columns with uniform distribution, selectivity is simply 1/NDV (number of distinct values). For range conditions like `age > 30`, selectivity depends on knowing how values are distributed — which is why databases collect statistics like histograms, most-common-value lists, and null fractions.

The real challenge arises when the optimizer must combine selectivity estimates across multiple conditions. For `WHERE age > 30 AND city = 'Denver'`, the standard assumption is independence: multiply the individual selectivities together. If age > 30 has selectivity 0.6 and city = 'Denver' has selectivity 0.05, the combined estimate is 0.03 — 3% of rows. This independence assumption is often wrong (age and city may correlate), but without multi-column statistics it is the best the optimizer can do. Correlated predicates are one of the most common sources of severe estimation errors.

Estimation errors compound through the plan. A 3x overestimate at a filter feeds into the join above it, which uses that inflated number to pick a join algorithm — perhaps choosing a hash join when a nested-loop join on a small result set would have been far faster. This cascading effect explains why a single bad selectivity estimate can make a query run 100x slower than optimal. When you encounter a mysteriously slow query, examining the optimizer's cardinality estimates (via EXPLAIN ANALYZE or equivalent) and comparing them to actual row counts is often the fastest path to diagnosis. The gap between estimated and actual rows points directly to the broken assumption.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsLogarithms IntroductionTime and Space ComplexityBinary SearchBinary Search TreesB-Tree IndexesQuery OptimizationQuery Cardinality and Selectivity Estimation

Longest path: 67 steps · 364 total prerequisite topics

Prerequisites (2)

Leads To (1)