Query Optimization

College Depth 65 in the knowledge graph I know this Set as goal
Unlocks 3 downstream topics
query optimization cost-based optimization rule-based optimization join ordering statistics

Core Idea

Query optimization is the process of automatically selecting the most efficient execution plan for a SQL query. Rule-based optimization applies algebraic equivalences (push selections early, project away unused columns) to reduce intermediate result sizes. Cost-based optimization estimates the I/O and CPU cost of candidate plans using table statistics (histograms, row counts, distinct value counts) and selects the minimum-cost plan. Join order selection is especially important — there are n!/2 distinct join orders for n tables, so optimizers use dynamic programming or greedy heuristics to search this space efficiently.

How It's Best Learned

Experiment with manual query rewrites and compare EXPLAIN outputs: move filters earlier, avoid function calls on indexed columns, rewrite NOT IN as NOT EXISTS. Run ANALYZE to refresh statistics and see how plans change.

Common Misconceptions

Explainer

When you write a SQL query, you describe *what* data you want — not *how* to retrieve it. The database's query optimizer bridges that gap. From your prerequisite work with query execution plans, you know that a single SQL statement can be executed in many different ways: scanning a full table, using an index, joining tables in different orders, or applying filters at different stages. The optimizer's job is to evaluate these alternatives and pick the one with the lowest estimated cost.

There are two broad optimization strategies. Rule-based optimization applies algebraic transformations that are almost always beneficial. For example, pushing a WHERE filter down so it runs before a join reduces the number of rows flowing into the join — a technique called predicate pushdown. Similarly, projecting away columns you don't need means less data to shuffle between operations. These rules come from the relational algebra equivalences you've studied: since the algebra defines when two expressions produce the same result, the optimizer can freely substitute one form for another. Rule-based rewrites are cheap to apply and form the first pass of most optimizers.

Cost-based optimization goes further by estimating the actual resource cost of candidate plans. The optimizer maintains table statistics — row counts, histograms of column value distributions, and distinct value counts — and uses these to estimate how many rows each operation will produce. For instance, if a table has 10 million rows and a histogram shows that 2% match the filter `status = 'active'`, the optimizer estimates 200,000 rows after filtering and can decide whether an index scan or a sequential scan is cheaper. This is where your knowledge of B-tree and hash indexes matters: the optimizer chooses an index scan only when selectivity is high enough that reading index pages plus random table lookups costs less than a single sequential pass.

The hardest sub-problem is join ordering. If your query joins five tables, there are 5!/2 = 60 possible orderings, each producing different intermediate result sizes. For ten tables, the number explodes to over 1.8 million. Optimizers use dynamic programming to find optimal orderings for small join counts and fall back to greedy heuristics for larger queries. The key insight is that the cheapest overall plan usually minimizes intermediate result sizes — joining the most selective tables first keeps the data flowing through subsequent joins as small as possible. When the optimizer chooses poorly (often due to stale statistics after a bulk load), running ANALYZE to refresh statistics is usually the first and most effective fix.

Practice Questions 5 questions

Prerequisite Chain

Counting to 10Counting to 20Understanding ZeroThe Number ZeroCounting to FiveOne-to-One CorrespondenceCombining Small Groups Within 5Addition Within 10Addition Within 20Two-Digit Addition Without RegroupingTwo-Digit Addition with RegroupingAddition Within 100Repeated Addition as MultiplicationMultiplication Facts Within 100Division as Equal SharingDivision as Grouping (Measurement Division)Division: Grouping (Repeated Subtraction) ModelDivision: Fair Sharing ModelDivision as Equal SharingDivision as GroupingBasic Division FactsDivision Facts Within 100Two-Digit by One-Digit DivisionDivision with RemaindersRemainders and Quotients in DivisionDivision Word ProblemsIntroduction to Long DivisionFactors and MultiplesPrime and Composite NumbersEquivalent FractionsRelating Fractions and DecimalsDecimal Place ValueReading and Writing DecimalsComparing and Ordering DecimalsAdding and Subtracting DecimalsMultiplying DecimalsDividing DecimalsDividing FractionsMixed Number ArithmeticOrder of OperationsInteger Order of OperationsVariable ExpressionsCombining Like TermsOne-Step EquationsTwo-Step EquationsSolving Multi-Step EquationsEquations with Variables on Both SidesLiteral EquationsSlope-Intercept FormPoint-Slope FormWriting Linear EquationsParallel and Perpendicular Line SlopesGraphing Linear EquationsPiecewise FunctionsStep FunctionsComposition of FunctionsInverse FunctionsRadical Functions and GraphsRational ExponentsExponential Functions and GraphsLogarithms IntroductionTime and Space ComplexityBinary SearchBinary Search TreesB-Tree IndexesQuery Optimization

Longest path: 66 steps · 362 total prerequisite topics

Prerequisites (5)

Leads To (2)