How a cost-based optimizer picks a query plan

The database estimates what each possible plan would cost, then runs the cheapest one.

The idea

Imagine you need one specific name in a phone book. You could read every entry from the front (a full table scan), or you could jump straight to the right page using the alphabetical tabs (an index seek). For a single name, the tabs win easily. But if you wanted everyone in the book, flipping back and forth between tabs and pages would be slower than just reading straight through.

A cost-based optimizer makes exactly this call for every query. The same SQL can run many ways — different access paths, join orders, join algorithms. Using table statistics (row counts, how selective a predicate is, which indexes exist), it estimates the cost — in rows touched and I/O — of each candidate plan, and picks the lowest one. It never runs them to find out; it estimates from stats and commits to one.

Pick a predicate and select Optimize to compare the two plans.

How it works

The optimizer enumerates candidate plans, costs each from statistics, and keeps the cheapest. A simplified cost model charges roughly one unit per row a plan must touch — a full scan touches every row, while an index seek touches only the rows the predicate is estimated to match (plus a few cheap index-tree hops). The estimate comes straight from the histogram and row count, never from running the query.

-- Same query, two candidate access paths the planner considers:
SELECT * FROM users WHERE id = 42;

-- Planner estimates each from table statistics:
--   total_rows           = 10000      (from pg_class / stats)
--   estimated_matches    = 1          (selectivity of id = 42)
--
-- cost(full scan)  = total_rows               = 10000 rows read
-- cost(index seek) = index_hops + matches     =    3 + 1 = 4 rows read
--
-- choose plan with min(cost)  ->  index seek wins (4 < 10000)

EXPLAIN SELECT * FROM users WHERE id = 42;
--  Index Scan using users_pkey on users  (cost=0.29..8.30 rows=1)
--  -- planner chose the seek; "rows=1" is the selectivity estimate

EXPLAIN SELECT * FROM users WHERE active = true;
--  Seq Scan on users  (cost=0.00..170.00 rows=9000)
--  -- low selectivity: a seek would touch almost everything,
--  --  so the full scan is now the cheaper plan

Trade-offs

Access pathWhen it winsRough cost model
Index seek Highly selective predicate (few matching rows), e.g. a primary-key or unique lookup tree hops + matched rows; cheap when matches are few
Full table scan Low selectivity (most rows match) or no usable index; reading sequentially beats random jumps all rows, read sequentially (cheap per row)
Index scan (range) A range or sorted read where index order matches; moderate selectivity matched rows + per-row lookups (random I/O can dominate)

The crossover point is selectivity. As estimated matches climb toward the table size, the per-row random-access overhead of the seek overtakes the cheap sequential read of a scan, and the optimizer flips its choice.

Watch out for

Worked example

Take SELECT * FROM users WHERE id = 42 against a 12-row sample of a 10,000-row table. With the high-selectivity predicate, the optimizer reads its stats: 10,000 total rows, estimated 1 match. The full-scan plan would walk every row — its cost counter climbs row by row to 12 in the sample (10,000 in reality). The index-seek plan instead hops down the B-tree (a handful of cheap node visits) and lands directly on row 42, touching just that one matching row. Costs compared: seek ≈ 4, scan = 10,000. The optimizer rings the seek as the winner and runs it.

Now switch the predicate to low selectivity (active = true, matching most rows). The seek would have to jump to nearly every row through the index one at a time — its estimated cost balloons past the scan, which reads the same rows but sequentially and cheaply. Same query shape, same table, but the cheaper plan is now the full scan. That flip is the whole point: the optimizer re-decides from the numbers, it does not have a favourite.

Check yourself

1. Your query WHERE id = 42 suddenly runs a slow full scan even though an index on id exists. What is the most likely culprit?

2. For WHERE active = true where 90% of rows are active, why might the optimizer pick a full scan over the index?