SQL index design

An index is the book's table of contents — it lets the database jump straight to a row instead of reading every page.

The idea

Imagine a phone book where the names are in no particular order. To find one person, you'd have to read every entry from the first page to the last. That is what a database does when it runs a sequential scan: it examines every row in the table.

An index is a separate, sorted copy of one column (or a few columns) with a pointer back to each row. Because it is sorted, the database can do an index seek — descend a tree to land near the answer in a handful of steps. The cost drops from looking at n rows to about log n.

See it work

Query: SELECT * FROM users WHERE id = 42

No index yet. Press play to watch a full table scan.

How it works

You declare an index on the column(s) you filter or join on. The database builds and maintains a sorted structure — almost always a B-tree — whose leaves hold the indexed values together with a pointer (row id) to the full row.

-- Build a B-tree index on the column used in the WHERE clause
CREATE INDEX idx_users_id ON users (id);

-- This query can now seek instead of scan
SELECT * FROM users WHERE id = 42;

To resolve id = 42, the engine descends the tree instead of reading every row:

# Descend the B-tree from the root (pseudocode)
node = root
while node is not a leaf:
    # each internal node holds sorted separator keys + child pointers
    i = first child whose key-range can contain 42
    node = node.children[i]

# at the leaf: binary-search the sorted keys
if 42 in node.keys:
    row_ptr = node.pointer_for(42)   # jump straight to the row
    return fetch(row_ptr)
return NOT FOUND
# depth of the tree is ~log(n), so only a few nodes are touched

The payoff depends on selectivity — what fraction of rows match. A highly selective predicate (a unique id or email) returns very few rows, so the seek is a big win. A low-selectivity predicate (like is_active = true matching half the table) returns so many rows that scanning is often cheaper, and the planner may ignore the index.

Cost / trade-offs

Dimension	Full scan	Index seek
Read cost (find a row)	O(n) — reads every row	O(log n) — descends the tree
Best for	Returning most of the table	Highly selective lookups
Write cost	None extra	Slows `INSERT` / `UPDATE` / `DELETE` — the index must be kept sorted
Storage	None extra	Extra disk for the sorted copy + pointers
Tiny tables	Fine — a scan is already fast	Rarely worth the overhead

An index is a read-time accelerator paid for at write time and in storage. Add one when a column is searched often and is selective; skip it when the table is small, the column repeats heavily, or the table is write-dominated.

Watch out for

Functions on the indexed column disable the seek. WHERE lower(email) = 'a@b.com' can't use a plain index on email — the predicate is non-sargable. Index the expression itself (CREATE INDEX ON users (lower(email))) or store the normalized value.
Leading-column rule for composite indexes. An index on (last_name, first_name) helps WHERE last_name = ? but not WHERE first_name = ? alone — you can't skip the leading column.
Low-selectivity columns waste an index. A column with a handful of repeated values (a status flag, a boolean) returns too many rows; the planner will usually scan anyway.
Over-indexing write-heavy tables. Every index must be updated on each write. Many indexes on a hot table can quietly throttle inserts and bloat storage.
Leading wildcards can't seek. LIKE 'smith%' can use a B-tree (it's a prefix range), but LIKE '%smith' cannot — there is no sorted prefix to descend to.

Worked example

A users table has 5 million rows and you run the login lookup:

SELECT id, password_hash FROM users WHERE email = 'ada@calc.io';

With no index, the planner has no sorted path to email, so it does a sequential scan — up to 5,000,000 row reads on every login. Because email is unique, the predicate is maximally selective, which is exactly the case an index serves best:

CREATE UNIQUE INDEX idx_users_email ON users (email);

Now the same query descends a B-tree of depth ~3, binary-searches the matching leaf, and follows one pointer to the row — roughly log&sub2;(5,000,000) ≈ 23 comparisons instead of millions. The unique constraint also guarantees at most one match, so the engine can stop the moment it lands. The write cost is real but modest: each sign-up now maintains one extra sorted structure.

Check yourself

1. You have an index on users (email). Which query can use it for a fast seek?

2. Which column is the worst candidate for its own index?