WAL, tombstones, and read amplification

Writes are cheap appends; the cost is paid later, when one read has to dig through many layers — until compaction tidies up.

The idea

A log-structured store makes writes cheap. Every write is first appended to a write-ahead log (WAL) on disk and fsync'd for durability, then applied to an in-memory memtable. When the memtable fills, it flushes to an immutable on-disk SSTable.

Deletes don't erase anything — they append a tombstone, a marker that shadows older values living in lower layers. Over time a single key's versions spread across many SSTables, so a read has to scan newest to oldest until it finds a value or a tombstone. That extra work per logical read is read amplification. Compaction merges SSTables, keeps only the newest version per key, drops tombstones once nothing older can hide beneath them, and so reclaims space and shrinks the read cost.

Press play to follow writes, a delete, two reads, and a compaction.

How it works

The whole store is append-only on the write path. put and delete never modify existing data; they add a newer record that shadows the old one. get walks from the freshest place (the memtable) toward the oldest layer and stops at the very first record it finds for the key — even if that record is a tombstone, which means "deleted".

def put(key, value):
    wal.append((key, value)); wal.fsync()   # durable first
    memtable[key] = value                    # then in memory

def delete(key):
    wal.append((key, TOMBSTONE)); wal.fsync()
    memtable[key] = TOMBSTONE                 # a marker, not an erase

def get(key):
    # newest source first: memtable, then SSTables newest -> oldest
    for src in [memtable] + sstables_newest_first():
        if key in src:                        # one disk read per SSTable
            rec = src[key]
            return None if rec is TOMBSTONE else rec
    return None                               # never written

def compact(runs):
    # merge several runs, keep only the newest version per key
    out = {}
    for src in runs_newest_first(runs):
        for key, rec in src:
            if key not in out:                # first seen == newest
                out[key] = rec
    # drop tombstones once no older data can hide beneath this merge
    return {k: r for k, r in out.items() if r is not TOMBSTONE}

A bloom filter per SSTable lets get skip a layer that provably can't contain the key, cutting some disk reads — but it can't help once the key really does live low down.

Cost

Operation	Cost
Write (put / delete)	O(1) amortized — one WAL append + fsync, one memtable insert
Read (get)	O(layers) worst case — memtable plus every SSTable until a hit; bloom filters trim, don't remove
Space	Old versions and tombstones linger until compaction reclaims them
Trade-off	Fast, durable, sequential writes now — read and space amplification paid back later, by I/O-heavy compaction

Watch out for

Tombstones that never get compacted away. A delete-heavy or range-delete workload accumulates tombstone bloat; range scans must still read past every tombstone, so a "mostly empty" range can be slow until compaction physically removes them.
fsync per write kills throughput. A durable fsync on every write trades latency for safety. Group commit / batched fsync amortizes it across many writes — at the cost of a small window where un-synced writes could be lost.
Read amplification climbs with deletes and overwrites. The more versions a key has spread across layers, the more layers a read may touch. Heavy churn makes reads progressively slower between compactions.
Bloom filters reduce but never remove amplification. They skip layers that can't hold the key, but carry a false-positive rate, and give no help once the key genuinely lives in a deep layer — that read still walks down to it.
Compaction is I/O-heavy and can stall foreground traffic. Merging large SSTables competes for disk and CPU; if it can't keep up with the write rate, the engine throttles writes (write stalls).

Worked example

Picture a queue table backed by an LSM store: rows are inserted at the tail and deleted from the head as they're processed. Each delete is a tombstone, not an erasure. After millions of items pass through, the "head" of the queue is buried under millions of tombstones for already-processed rows. A scan for the next live item — SELECT … LIMIT 1 at the head — must read past every one of those tombstones before reaching real data, and latency falls off a cliff.

Triggering compaction merges the runs, drops the obsolete rows and their now-unneeded tombstones, and collapses the head back to a thin layer. The same scan that crawled is fast again. This is exactly why delete-as-a-queue patterns are a known footgun on log-structured engines.

Check yourself

A DELETE just ran on a key. Is the row's data gone from disk?