How a cost-based optimizer picks a query plan

The database estimates what each possible plan would cost, then runs the cheapest one.

The idea

Imagine you need one specific name in a phone book. You could read every entry from the front (a full table scan), or you could jump straight to the right page using the alphabetical tabs (an index seek). For a single name, the tabs win easily. But if you wanted everyone in the book, flipping back and forth between tabs and pages would be slower than just reading straight through.

A cost-based optimizer makes exactly this call for every query. The same SQL can run many ways — different access paths, join orders, join algorithms. Using table statistics (row counts, how selective a predicate is, which indexes exist), it estimates the cost — in rows touched and I/O — of each candidate plan, and picks the lowest one. It never runs them to find out; it estimates from stats and commits to one.

Predicate selectivity:

Pick a predicate and select Optimize to compare the two plans.

How it works

The optimizer enumerates candidate plans, costs each from statistics, and keeps the cheapest. A simplified cost model charges roughly one unit per row a plan must touch — a full scan touches every row, while an index seek touches only the rows the predicate is estimated to match (plus a few cheap index-tree hops). The estimate comes straight from the histogram and row count, never from running the query.

-- Same query, two candidate access paths the planner considers:
SELECT * FROM users WHERE id = 42;

-- Planner estimates each from table statistics:
--   total_rows           = 10000      (from pg_class / stats)
--   estimated_matches    = 1          (selectivity of id = 42)
--
-- cost(full scan)  = total_rows               = 10000 rows read
-- cost(index seek) = index_hops + matches     =    3 + 1 = 4 rows read
--
-- choose plan with min(cost)  ->  index seek wins (4 < 10000)

EXPLAIN SELECT * FROM users WHERE id = 42;
--  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1)
--  -- planner chose the seek; "rows=1" is the selectivity estimate

EXPLAIN SELECT * FROM users WHERE active = true;
--  Seq Scan on users  (cost=0.00..170.00 rows=9000)
--  -- low selectivity: a seek would touch almost everything,
--  --  so the full scan is now the cheaper plan

Trade-offs

Access path	When it wins	Rough cost model
Index seek	Highly selective predicate (few matching rows), e.g. a primary-key or unique lookup	tree hops + matched rows; cheap when matches are few
Full table scan	Low selectivity (most rows match) or no usable index; reading sequentially beats random jumps	all rows, read sequentially (cheap per row)
Index scan (range)	A range or sorted read where index order matches; moderate selectivity	matched rows + per-row lookups (random I/O can dominate)

The crossover point is selectivity. As estimated matches climb toward the table size, the per-row random-access overhead of the seek overtakes the cheap sequential read of a scan, and the optimizer flips its choice.

Watch out for

Stale statistics. The optimizer trusts its stats. If a table grew from 1,000 to 10,000,000 rows but stats were never refreshed, it may estimate a tiny result and pick a seek that actually touches millions of rows. Keep ANALYZE / auto-stats current.
Parameter sniffing and non-selective predicates. A plan cached for id = 42 (one row) can be reused for status = 'active' (most rows). The seek that was perfect for the first is a disaster for the second. Watch for one plan serving wildly different selectivities.
Low-cardinality columns. An index on a column with two values (is_deleted) rarely helps — any single value still matches roughly half the table, so the optimizer correctly prefers a scan. Indexing such columns adds write cost for little read benefit.
Functions on indexed columns. WHERE LOWER(email) = 'x' or WHERE id + 0 = 42 hides the column behind a function, so the plain index can't be used and you fall back to a scan. Index the expression itself, or rewrite to leave the column bare.
Estimate-versus-actual row skew. The chosen plan is only as good as the estimate. Correlated predicates and skewed data fool the row-count math; compare EXPLAIN (estimated) against EXPLAIN ANALYZE (actual) to spot where the optimizer guessed wrong.

Worked example

Take SELECT * FROM users WHERE id = 42 against a 12-row sample of a 10,000-row table. With the high-selectivity predicate, the optimizer reads its stats: 10,000 total rows, estimated 1 match. The full-scan plan would walk every row — its cost counter climbs row by row to 12 in the sample (10,000 in reality). The index-seek plan instead hops down the B-tree (a handful of cheap node visits) and lands directly on row 42, touching just that one matching row. Costs compared: seek ≈ 4, scan = 10,000. The optimizer rings the seek as the winner and runs it.

Now switch the predicate to low selectivity (active = true, matching most rows). The seek would have to jump to nearly every row through the index one at a time — its estimated cost balloons past the scan, which reads the same rows but sequentially and cheaply. Same query shape, same table, but the cheaper plan is now the full scan. That flip is the whole point: the optimizer re-decides from the numbers, it does not have a favourite.

Check yourself

1. Your query WHERE id = 42 suddenly runs a slow full scan even though an index on id exists. What is the most likely culprit?

2. For WHERE active = true where 90% of rows are active, why might the optimizer pick a full scan over the index?