Write amplification & compaction

One write from your app becomes many writes on disk — the storage engine keeps re-copying the same data downhill.

The idea

Databases like RocksDB, LevelDB, and Cassandra use an LSM-tree (log-structured merge tree). To make writes fast, they never overwrite data in place. A new write lands in memory and an append-only log, then gets flushed to an immutable sorted file (an SSTable) at the top level.

The catch: those files pile up, so a background process called compaction keeps merging and rewriting them down through the levels — L0 into L1, L1 into L2, and so on. Every merge rewrites bytes that were already on disk. By the time a single 1 KB user write settles into the deepest level, the disk may have physically written it ten, twenty, even thirty times. That ratio — physical bytes written ÷ logical bytes written — is the write amplification factor (WAF).

Press play, or step through, to watch one logical write turn into many physical writes.

How it works

Writes take a fast path into memory; the slow, expensive part is compaction running in the background. Each compaction rewrites data that is already durable on disk, which is exactly where the amplification comes from.

# LSM-tree write path (simplified)

def put(key, value):
    wal.append(key, value)        # 1 durable write (write-ahead log)
    memtable[key] = value         # in-memory only, no SSTable write yet
    if memtable.full():
        flush_to_L0(memtable)     # write a brand-new SSTable at level 0

def compact(level):
    # merge overlapping SSTables from `level` into `level + 1`
    inputs = tables_at[level] + overlapping(tables_at[level + 1])
    merged = merge_sorted(inputs)        # keep newest value per key, drop tombstones
    write_sstables(level + 1, merged)    # REWRITES data already on disk
    delete(inputs)                       # old files removed once the new ones are durable

# write amplification factor (WAF)
#   = total bytes written to disk  /  bytes the application wrote
#   = (WAL + flush + every compaction pass) / logical bytes

Cost

Dimension	What it measures	Typical for leveled LSM
Write amplification	Disk bytes written per logical byte	~10–30×
Read amplification	SSTables checked per point lookup	~1 per level (bloom filters help)
Space amplification	Disk used per byte of live data	~1.1× (leveled keeps it tight)
Write latency	Cost the user pays per write	Low — sequential append, no random I/O

The core trade-off: an LSM-tree accepts high write amplification in exchange for fast sequential writes and strong compression. A B-tree does the opposite — lower write amplification, but it pays with random in-place I/O on every update.

Watch out for

Leveled vs tiered compaction. Leveled compaction keeps space amplification low (~1.1×) but rewrites more often, raising write amplification. Tiered (size-tiered) compaction flips it: less rewriting, lower write-amp, but data sits duplicated across files, so space-amp balloons. Pick the one that matches your bottleneck.
Write stalls. If ingest outruns compaction, L0 files stack up and the engine throttles or pauses writes until compaction catches up. A read-heavy benchmark can hide this; a sustained write burst exposes it.
SSD wear. A WAF of 20× means the flash endurance budget drains 20× faster than the logical write rate suggests. Capacity-plan for physical writes, not logical ones.
Latency spikes from big compactions. A large L0→L1 or deep-level merge moves a lot of data at once and competes with foreground reads for disk and CPU, showing up as tail-latency bumps.
Measuring WAF wrong. True amplification counts the WAL plus the initial flush plus every compaction pass. Counting only one stage badly understates it.

Worked example

Follow one 1 KB key the application writes exactly once, and count the physical writes it triggers as it sinks through the tree:

Stage	Bytes written to disk	Running total
Append to WAL	1 KB	1 KB
Flush memtable → L0	1 KB	2 KB
Compact L0 → L1	1 KB	3 KB
Compact L1 → L2	1 KB	4 KB
Compact L2 → L3	1 KB	5 KB

The user wrote 1 KB; the disk wrote ~5 KB, so WAF ≈ 5× in this stripped-down trace. Real engines are worse: each compaction also re-reads and rewrites the overlapping data already living at the target level (the leveled fan-out, often ~10× per level), and keys get caught in several passes. That fan-out is why production write amplification commonly lands in the 10–30× range.

Check yourself

An LSM-tree engine reports a write amplification factor of 22×. Your application writes 50 GB of data over a day. Roughly how much does the disk physically write?

Your service has a steady, heavy write load and your storage is filling up faster than expected. You care most about keeping space amplification low. Which compaction strategy fits?