Questions: Query Cardinality and Selectivity Estimation
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A query optimizer estimates that WHERE age > 30 AND city = 'Denver' will return 3,000 rows (using independence: selectivity 0.6 × 0.05 on a 100,000-row table). The actual result is 200 rows because age and city are correlated. What plan failure is most likely?
AThe optimizer will generate a syntax error because correlated columns require special syntax
BThe optimizer will skip the city filter entirely to reduce computation
CThe optimizer will choose a join algorithm and scan strategy suited for 3,000 rows — likely a hash join or full scan — when a nested-loop join on the actual 200-row result would be far faster
DThe optimizer will correctly detect the correlation and rewrite the query
A 15× overestimate (3,000 vs. 200 rows) causes the optimizer to select algorithms designed for large intermediate results. Hash joins have good throughput for large sets but high setup cost; nested-loop joins are optimal for small result sets. The optimizer, believing 3,000 rows are coming, picks the wrong algorithm. Without multi-column statistics, the optimizer cannot detect the correlation — it only has per-column selectivity estimates and assumes independence. This is a classic example of how a single estimation error cascades into a bad plan.
Question 2 Multiple Choice
A column has 200 distinct values uniformly distributed across a 1,000,000-row table. What selectivity does the optimizer assign to an equality condition on this column?
A0.5 — equality conditions match about half the rows on average
C1.0 — every row matches some value, so selectivity is 1
DUnknown — selectivity cannot be calculated without a histogram
For equality conditions on uniformly distributed columns, selectivity = 1/NDV = 1/200 = 0.005. With 1,000,000 rows, the optimizer estimates 5,000 matching rows. This is the baseline formula. Non-uniform distributions require histograms or most-common-value lists to refine the estimate — if 90% of rows have one value, a uniform assumption drastically underestimates that value's selectivity and overestimates all others.
Question 3 True / False
When estimating the combined selectivity of multiple filter conditions (e.g., WHERE a = 1 AND b = 2), databases multiply the individual selectivities together. This is generally accurate.
TTrue
FFalse
Answer: False
The independence assumption — multiply individual selectivities — is standard practice but is only accurate when the columns are truly uncorrelated. Many real-world columns correlate: age and retirement status, city and state, product category and price range. When columns correlate, the actual combined selectivity may be much higher or lower than the product predicts. Correlated predicates are one of the most common causes of severe cardinality estimation errors and the query plan failures that follow.
Question 4 True / False
A single large cardinality estimation error at an early stage of a query execution plan can cause dramatically poor performance for the entire query, including join strategy and index selection decisions downstream.
TTrue
FFalse
Answer: True
Cardinality estimates cascade through the plan tree. An overestimate at a filter makes the join above it receive an inflated row count, causing the optimizer to choose a join algorithm appropriate for large inputs (hash join) rather than the optimal one for a small result (nested-loop). Index-vs-full-scan decisions are similarly driven by estimated row counts. A single bad estimate propagates and amplifies — a 100× error at a filter can easily produce a query that runs 1,000× slower than the optimal plan.
Question 5 Short Answer
When diagnosing an unexpectedly slow SQL query, why is comparing 'estimated rows' to 'actual rows' in the query execution plan often the fastest path to finding the root cause?
Think about your answer, then reveal below.
Model answer: Because every plan decision — join algorithm, index vs. scan, join order — is driven by the optimizer's cardinality estimates. A large gap between estimated and actual rows at a specific plan node directly identifies which assumption was wrong and which operation is misconfigured. The location of the gap points to stale statistics, a correlated predicate being treated as independent, or a skewed distribution the optimizer didn't know about.
EXPLAIN ANALYZE (or equivalent) shows estimated vs. actual row counts for every node in the plan tree. A gap of 10× or more at a node is a strong signal that the statistics or selectivity model for that node is broken. Once you identify the broken node, the fix is usually specific: refresh table statistics with ANALYZE, add multi-column statistics for correlated columns, add an index hint, or rewrite the query to give the optimizer better information. Without this diagnostic, you'd be guessing which part of the query is slow.