Questions: Table Statistics, Histograms, and Column Statistics
5 questions to test your understanding
Score: 0 / 5
Question 1 Multiple Choice
A 'city' column in a 1,000,000-row table has 500 distinct values and contains 'New York' for 40% of rows. The optimizer assumes uniform distribution. How will it estimate the row count for WHERE city = 'New York'?
A400,000 rows — it recognizes 'New York' as the dominant value
B2,000 rows — dividing total rows by the number of distinct values
C1,000,000 rows — it defaults to full table scan estimates
D0 rows — no statistics are available without an explicit ANALYZE
With uniform distribution assumed and 500 distinct values, the optimizer estimates 1/500 × 1,000,000 = 2,000 rows for any specific city. The true count is 400,000 — two orders of magnitude off. This misestimate has real consequences: the optimizer would likely choose an index scan (efficient for 2,000 rows) when a sequential scan would be far faster for 400,000 rows. This is exactly why histograms and MFV lists exist — to capture the actual distribution instead of assuming uniformity.
Question 2 Multiple Choice
For a highly skewed column where a few values account for most rows, an equi-depth histogram provides better selectivity estimates than an equi-width histogram. Why?
AEqui-depth uses less memory, leaving room for more precise per-value statistics
BEqui-depth adjusts bucket boundaries so popular ranges get narrower buckets with more precise estimates, concentrating precision where data is dense
CEqui-depth captures the exact frequencies of the most common values, eliminating estimation error for those values
DEqui-depth requires no maintenance after data changes, while equi-width must be rebuilt after every insert
Equi-depth (equi-height) histograms place bucket boundaries so each bucket contains roughly the same number of rows. In a skewed distribution, the dense ranges get many narrow buckets — high resolution where it matters most. Equi-width divides the value range into equal intervals, so a skewed distribution packs most rows into a few wide buckets, yielding crude estimates for the popular ranges. Note that MFV (most-frequent-value) lists, not equi-depth histograms, capture exact counts for top values.
Question 3 True / False
When a query plan suddenly degrades after a large bulk insert, stale statistics are one of the first things worth checking.
TTrue
FFalse
Answer: True
Statistics describe the data distribution at the time they were collected. After a large bulk insert, row counts, value distributions, and selectivities may have changed dramatically. If the optimizer is still consulting pre-insert statistics, it may choose a plan that was optimal for the old data but performs poorly on the new data — for example, using an index that is no longer selective, or choosing a nested-loop join when a hash join would be faster. Running ANALYZE (or equivalent) after bulk operations is a standard DBA practice.
Question 4 True / False
Collecting exact column statistics in a database requires mainly a brief metadata lookup — the system tracks distributions automatically without scanning the table.
TTrue
FFalse
Answer: False
Exact statistics require reading and analyzing the actual data values, which means a full table scan. For large tables, this is expensive. That is why databases use sampling — reading a random subset of pages or rows and extrapolating. PostgreSQL's ANALYZE, for example, samples a configurable number of rows per column. Sampling is fast enough to run regularly but introduces sampling error. The tradeoff is speed of collection versus precision of estimates.
Question 5 Short Answer
Why do databases use sampling rather than full-table scans to build statistics, and what is the key operational risk of this approach?
Think about your answer, then reveal below.
Model answer: Sampling reads only a fraction of the table (e.g., a few thousand rows) and extrapolates column distributions from that subset. This makes statistics collection fast enough to run regularly without heavily impacting the system. The key operational risk is that statistics become stale as data changes: inserts, deletes, and updates shift the actual distribution away from what the statistics describe, causing the optimizer to make poor plan choices based on an outdated picture of the data.
The staleness risk is the core operational concern. High-churn tables — those with frequent inserts or deletes — may need frequent ANALYZE runs, and most databases offer auto-analyze features. But in practice, unexpected query regressions after bulk operations or schema migrations often trace back to stale statistics. A DBA's first question when a previously fast query becomes slow is often: 'When were statistics last collected?'