Secondary indexes

A table is sorted by its primary key. To find rows by any other column quickly, you keep a second little sorted book that points back to them.

The idea

A table is physically organized by its primary key — rows are kept in id order, so finding a row by id is a quick jump. But a query like "all users in the eng department" filters on a non-primary column. With nothing to help, the engine reads every row and checks its dept: a full scan, O(n).

A secondary index is an auxiliary sorted structure that maps a column value to the primary keys that have it: dept → [ids]. Now the engine looks up the value, gets the matching ids straight away, and fetches just those rows. The index is maintained on every write that touches the indexed column, and consulted on every read that filters by it.

base table · ordered by id id name dept 1anaeng 2bosales 3cyeng 4di ? ? ? ? index on dept → ids dept ids eng[1, 3] sales[2] covering: index also stores name → no base fetch
Rows are stored in id order — finding by id is a jump, finding by dept is not.

How it works

The index is just a sorted map kept beside the table. Two operations matter: maintain it whenever an indexed column is written, and look it up when a read filters on that column. A covering index also stores the columns the query returns, so the read never touches the base table; a non-covering index gives back only ids and must fetch each row.

# maintain on every write that touches the indexed column
def write(id, old_dept, new_dept):
    if old_dept is not None:
        index[old_dept].remove(id)   # leave the stale bucket
    index[new_dept].add(id)          # join the new bucket, kept sorted

# read: filter by dept
def find_by_dept(dept):
    ids = index.get(dept, [])        # one lookup, not a full scan
    if COVERING:                     # index already holds the wanted columns
        return [(i, index_cols[i]) for i in ids]
    else:                            # non-covering: one base fetch per id
        return [base_table[i] for i in ids]   # N back-trips

The lookup cost is roughly the depth of the index plus the number of matches. The hidden cost lives in the non-covering branch: one base-table fetch per matching id — read amplification that a covering index removes by carrying the answer itself.

Signals

ChoiceBuys youCosts you
Add a secondary indexFilter reads on that column go from O(n) to a lookupEvery write to the column updates the index too; extra storage
Covering indexRead answered from the index alone, no base fetchesMore storage, more to maintain per write
Non-covering indexSmaller, cheaper to keep currentOne base-table fetch per matching row — read amplification
High-cardinality columnEach value selects very few rows, big speedupLow-cardinality columns barely beat a full scan

Watch out for

Worked example

Query: SELECT name FROM users WHERE dept = 'eng'. With a secondary index on dept, the engine looks up the eng bucket and gets [1, 3] at once — rows 2 and 4 are never read. If the index is non-covering, it then fetches base rows 1 and 3 to read their name: two extra back-trips, one per match. If the index is covering — it also stores name — the bucket already holds (1, ana) and (3, cy), so the answer returns with zero base fetches. Same lookup; covering trades storage to skip the read amplification.

Check yourself

A non-covering index on dept finds 500 matching ids for a query. What extra work happens before the rows are returned?

Coach note: a non-covering index returns only primary keys, so every match needs a trip back to the base table — that's the read amplification a covering index removes. Take another pass if covering vs non-covering still blurs together.