Secondary index maintenance

Every write to the table is really two writes — the row, and the index entry that points at it.

The idea

A base table is keyed by its primary key. To find rows by some other column — say email — without scanning every row, you build a secondary index: a separate structure sorted by that column, where each entry maps email → pk.

The catch is that the index is a copy of part of the data, so it can drift out of sync. Every base-table write has to maintain it: an insert adds an entry, a delete removes one, and an update of the indexed column must remove the old entry and insert a new one — because the entry's sorted position changes. Get that delete-old step wrong and the index points at stale data.

See it work

Press play to watch writes propagate from the table into the index.

How it works

The index is sorted by (indexed_column, pk) so lookups are a binary search. Every write touches both structures inside one transaction, so the index can never diverge from the table. The update case is the subtle one: the old entry can't be edited in place, because changing the value moves where it sorts — you must delete the old key and insert the new one.

def insert(pk, email, row):
    table[pk] = row
    index.insert((email, pk))         # new entry at sorted position

def update(pk, email_old, email_new, row):
    table[pk] = row
    if email_old != email_new:
        index.delete((email_old, pk)) # remove stale entry
        index.insert((email_new, pk)) # add new entry
    # both under one transaction so the index never diverges

def delete(pk, email):
    del table[pk]
    index.delete((email, pk))         # drop the entry too

def find_by_email(email):
    pk = index.seek(email)            # O(log n) into the index
    return table[pk]                  # then fetch the row by pk

Cost

Operation	Work	Why
Find by email	O(log n) seek + 1 fetch	Binary-search the index, then read the row by pk — vs O(n) full scan with no index
Insert / delete row	1 row write + 1 index write	The entry is added or removed alongside the row
Update indexed column	1 row write + 2 index writes	Delete old entry, insert new one — write amplification
Storage	+1 sorted copy per index	The indexed column and pk are duplicated; more indexes means more disk

Watch out for

Forgetting to delete the old entry on update. If you only insert the new key, the stale entry survives and find_by_email(old) still returns the row — an orphaned, wrong result. Always pair delete-old with insert-new.
Non-atomic table and index writes. If the row commits but the index write is lost to a crash, the two diverge. Wrap both in one transaction; on recovery, rely on read-repair or an index rebuild.
Over-indexing. Every extra index multiplies write amplification — one row update can fan out to many index ops. Index the columns you actually query, not every column.
Index bloat. Churn from updates and deletes leaves tombstones and dead space. Indexes need periodic compaction or a rebuild to stay tight and fast.
Non-covering indexes. If the index only stores email → pk, the read still has to fetch the row by pk. A covering index that stores the needed columns avoids that second hop — at the cost of more storage.

Worked example

The table holds users keyed by pk, indexed by email. Row pk=7 has email = "a@x", so the index contains the entry ("a@x", 7). The user changes their address: update(7, "a@x", "b@x", row). The table row is rewritten, then the index does two ops — delete(("a@x", 7)) removes the stale entry, and insert(("b@x", 7)) places the new one at its sorted position. Now find_by_email("b@x") seeks to pk 7 and fetches the row. Had you skipped the delete, find_by_email("a@x") would still seek to pk 7 and return the row — a stale hit pointing at someone who no longer has that email.

Check yourself

You update the indexed column of a row from "a@x" to "b@x". What must happen to the index?

Coach note: the index is sorted by the value, so a new value belongs at a different position — you can't edit in place, and leaving the old entry behind makes lookups return stale rows.