SQL index design

An index is the book's table of contents — it lets the database jump straight to a row instead of reading every page.

The idea

Imagine a phone book where the names are in no particular order. To find one person, you'd have to read every entry from the first page to the last. That is what a database does when it runs a sequential scan: it examines every row in the table.

An index is a separate, sorted copy of one column (or a few columns) with a pointer back to each row. Because it is sorted, the database can do an index seek — descend a tree to land near the answer in a handful of steps. The cost drops from looking at n rows to about log n.

See it work

Query: SELECT * FROM users WHERE id = 42
No index yet. Press play to watch a full table scan.

How it works

You declare an index on the column(s) you filter or join on. The database builds and maintains a sorted structure — almost always a B-tree — whose leaves hold the indexed values together with a pointer (row id) to the full row.

-- Build a B-tree index on the column used in the WHERE clause
CREATE INDEX idx_users_id ON users (id);

-- This query can now seek instead of scan
SELECT * FROM users WHERE id = 42;

To resolve id = 42, the engine descends the tree instead of reading every row:

# Descend the B-tree from the root (pseudocode)
node = root
while node is not a leaf:
    # each internal node holds sorted separator keys + child pointers
    i = first child whose key-range can contain 42
    node = node.children[i]

# at the leaf: binary-search the sorted keys
if 42 in node.keys:
    row_ptr = node.pointer_for(42)   # jump straight to the row
    return fetch(row_ptr)
return NOT FOUND
# depth of the tree is ~log(n), so only a few nodes are touched

The payoff depends on selectivity — what fraction of rows match. A highly selective predicate (a unique id or email) returns very few rows, so the seek is a big win. A low-selectivity predicate (like is_active = true matching half the table) returns so many rows that scanning is often cheaper, and the planner may ignore the index.

Cost / trade-offs

DimensionFull scanIndex seek
Read cost (find a row)O(n) — reads every rowO(log n) — descends the tree
Best forReturning most of the tableHighly selective lookups
Write costNone extraSlows INSERT / UPDATE / DELETE — the index must be kept sorted
StorageNone extraExtra disk for the sorted copy + pointers
Tiny tablesFine — a scan is already fastRarely worth the overhead

An index is a read-time accelerator paid for at write time and in storage. Add one when a column is searched often and is selective; skip it when the table is small, the column repeats heavily, or the table is write-dominated.

Watch out for

Worked example

A users table has 5 million rows and you run the login lookup:

SELECT id, password_hash FROM users WHERE email = 'ada@calc.io';

With no index, the planner has no sorted path to email, so it does a sequential scan — up to 5,000,000 row reads on every login. Because email is unique, the predicate is maximally selective, which is exactly the case an index serves best:

CREATE UNIQUE INDEX idx_users_email ON users (email);

Now the same query descends a B-tree of depth ~3, binary-searches the matching leaf, and follows one pointer to the row — roughly log&sub2;(5,000,000) ≈ 23 comparisons instead of millions. The unique constraint also guarantees at most one match, so the engine can stop the moment it lands. The write cost is real but modest: each sign-up now maintains one extra sorted structure.

Check yourself

1. You have an index on users (email). Which query can use it for a fast seek?

2. Which column is the worst candidate for its own index?