Write amplification & compaction

One write from your app becomes many writes on disk — the storage engine keeps re-copying the same data downhill.

The idea

Databases like RocksDB, LevelDB, and Cassandra use an LSM-tree (log-structured merge tree). To make writes fast, they never overwrite data in place. A new write lands in memory and an append-only log, then gets flushed to an immutable sorted file (an SSTable) at the top level.

The catch: those files pile up, so a background process called compaction keeps merging and rewriting them down through the levels — L0 into L1, L1 into L2, and so on. Every merge rewrites bytes that were already on disk. By the time a single 1 KB user write settles into the deepest level, the disk may have physically written it ten, twenty, even thirty times. That ratio — physical bytes written ÷ logical bytes written — is the write amplification factor (WAF).

Press play, or step through, to watch one logical write turn into many physical writes.

How it works

Writes take a fast path into memory; the slow, expensive part is compaction running in the background. Each compaction rewrites data that is already durable on disk, which is exactly where the amplification comes from.

# LSM-tree write path (simplified)

def put(key, value):
    wal.append(key, value)        # 1 durable write (write-ahead log)
    memtable[key] = value         # in-memory only, no SSTable write yet
    if memtable.full():
        flush_to_L0(memtable)     # write a brand-new SSTable at level 0

def compact(level):
    # merge overlapping SSTables from `level` into `level + 1`
    inputs = tables_at[level] + overlapping(tables_at[level + 1])
    merged = merge_sorted(inputs)        # keep newest value per key, drop tombstones
    write_sstables(level + 1, merged)    # REWRITES data already on disk
    delete(inputs)                       # old files removed once the new ones are durable

# write amplification factor (WAF)
#   = total bytes written to disk  /  bytes the application wrote
#   = (WAL + flush + every compaction pass) / logical bytes

Cost

DimensionWhat it measuresTypical for leveled LSM
Write amplificationDisk bytes written per logical byte~10–30×
Read amplificationSSTables checked per point lookup~1 per level (bloom filters help)
Space amplificationDisk used per byte of live data~1.1× (leveled keeps it tight)
Write latencyCost the user pays per writeLow — sequential append, no random I/O

The core trade-off: an LSM-tree accepts high write amplification in exchange for fast sequential writes and strong compression. A B-tree does the opposite — lower write amplification, but it pays with random in-place I/O on every update.

Watch out for

Worked example

Follow one 1 KB key the application writes exactly once, and count the physical writes it triggers as it sinks through the tree:

StageBytes written to diskRunning total
Append to WAL1 KB1 KB
Flush memtable → L01 KB2 KB
Compact L0 → L11 KB3 KB
Compact L1 → L21 KB4 KB
Compact L2 → L31 KB5 KB

The user wrote 1 KB; the disk wrote ~5 KB, so WAF ≈ 5× in this stripped-down trace. Real engines are worse: each compaction also re-reads and rewrites the overlapping data already living at the target level (the leveled fan-out, often ~10× per level), and keys get caught in several passes. That fan-out is why production write amplification commonly lands in the 10–30× range.

Check yourself

An LSM-tree engine reports a write amplification factor of 22×. Your application writes 50 GB of data over a day. Roughly how much does the disk physically write?

Your service has a steady, heavy write load and your storage is filling up faster than expected. You care most about keeping space amplification low. Which compaction strategy fits?