One write from your app becomes many writes on disk — the storage engine keeps re-copying the same data downhill.
Databases like RocksDB, LevelDB, and Cassandra use an LSM-tree (log-structured merge tree). To make writes fast, they never overwrite data in place. A new write lands in memory and an append-only log, then gets flushed to an immutable sorted file (an SSTable) at the top level.
The catch: those files pile up, so a background process called compaction keeps merging and rewriting them down through the levels — L0 into L1, L1 into L2, and so on. Every merge rewrites bytes that were already on disk. By the time a single 1 KB user write settles into the deepest level, the disk may have physically written it ten, twenty, even thirty times. That ratio — physical bytes written ÷ logical bytes written — is the write amplification factor (WAF).
Writes take a fast path into memory; the slow, expensive part is compaction running in the background. Each compaction rewrites data that is already durable on disk, which is exactly where the amplification comes from.
# LSM-tree write path (simplified)
def put(key, value):
wal.append(key, value) # 1 durable write (write-ahead log)
memtable[key] = value # in-memory only, no SSTable write yet
if memtable.full():
flush_to_L0(memtable) # write a brand-new SSTable at level 0
def compact(level):
# merge overlapping SSTables from `level` into `level + 1`
inputs = tables_at[level] + overlapping(tables_at[level + 1])
merged = merge_sorted(inputs) # keep newest value per key, drop tombstones
write_sstables(level + 1, merged) # REWRITES data already on disk
delete(inputs) # old files removed once the new ones are durable
# write amplification factor (WAF)
# = total bytes written to disk / bytes the application wrote
# = (WAL + flush + every compaction pass) / logical bytes
| Dimension | What it measures | Typical for leveled LSM |
|---|---|---|
| Write amplification | Disk bytes written per logical byte | ~10–30× |
| Read amplification | SSTables checked per point lookup | ~1 per level (bloom filters help) |
| Space amplification | Disk used per byte of live data | ~1.1× (leveled keeps it tight) |
| Write latency | Cost the user pays per write | Low — sequential append, no random I/O |
The core trade-off: an LSM-tree accepts high write amplification in exchange for fast sequential writes and strong compression. A B-tree does the opposite — lower write amplification, but it pays with random in-place I/O on every update.
Follow one 1 KB key the application writes exactly once, and count the physical writes it triggers as it sinks through the tree:
| Stage | Bytes written to disk | Running total |
|---|---|---|
| Append to WAL | 1 KB | 1 KB |
| Flush memtable → L0 | 1 KB | 2 KB |
| Compact L0 → L1 | 1 KB | 3 KB |
| Compact L1 → L2 | 1 KB | 4 KB |
| Compact L2 → L3 | 1 KB | 5 KB |
The user wrote 1 KB; the disk wrote ~5 KB, so WAF ≈ 5× in this stripped-down trace. Real engines are worse: each compaction also re-reads and rewrites the overlapping data already living at the target level (the leveled fan-out, often ~10× per level), and keys get caught in several passes. That fan-out is why production write amplification commonly lands in the 10–30× range.
An LSM-tree engine reports a write amplification factor of 22×. Your application writes 50 GB of data over a day. Roughly how much does the disk physically write?
Your service has a steady, heavy write load and your storage is filling up faster than expected. You care most about keeping space amplification low. Which compaction strategy fits?