Compaction stalls

An LSM-tree storage engine throttles your writes when the background merge falls behind the incoming flood.

The idea

Log-structured merge (LSM) engines like RocksDB, Cassandra, and LevelDB take writes fast by never updating data in place. A write lands in an in-memory memtable; when it fills, the engine flushes it as a brand-new immutable file — an SSTable — into the top level, L0.

Those files pile up. A background process called compaction merges them downward into bigger, sorted levels, dropping overwritten keys and tombstones along the way. But compaction is slower than a heavy write burst. When L0 files accumulate past a threshold, the engine protects itself by applying backpressure: first it slows incoming writes, then it stalls them entirely until compaction drains the backlog.

See it work

Press play to watch a write burst push L0 toward the stall threshold.

How it works

The write path is append-only and tiered:

1. WAL + memtable. Every write is appended to a write-ahead log (for crash recovery) and inserted into a sorted in-memory memtable. 2. Flush. When the memtable fills, it becomes immutable and is flushed to disk as a new L0 SSTable — a sorted, self-contained file. 3. Compaction. A background job merges overlapping SSTables into the next level down (L0→L1→L2), producing fewer, larger, non-overlapping files and physically dropping overwritten keys and deletion tombstones.

The catch is read amplification: a point lookup may have to check every L0 file (they have overlapping key ranges) plus one file per deeper level. Bloom filters skip files that can't contain the key, but more files still means slower reads. Compaction keeps file count and read amplification bounded — so when it falls behind, the engine throttles writes rather than let reads rot.

RocksDB gates writes on the L0 file count with two triggers (level0_slowdown_writes_trigger and level0_stop_writes_trigger) and on a soft/hard_pending_compaction_bytes_limit:

# Called on every flush, before admitting more writes.
# Defaults shown are RocksDB-style; tune per workload.
slowdown_trigger = 20   # level0_slowdown_writes_trigger
stop_trigger     = 36   # level0_stop_writes_trigger

def admit_write(db):
    n = db.num_l0_files()
    if n >= stop_trigger or db.pending_compaction_bytes() >= hard_limit:
        block_writes()          # hard stall: foreground writes wait
    elif n >= slowdown_trigger or db.pending_compaction_bytes() >= soft_limit:
        rate_limit_writes()     # soft slowdown: delay each write a little
    else:
        admit()                 # healthy: write flows at full speed

(The visual above uses small thresholds — slowdown at L0 ≥ 8, stop at L0 ≥ 12 — so the pile-up and drain are easy to see in a few steps.)

Cost / signals

FactorWhat it measuresHow to read it
Write amplification Bytes written to disk ÷ bytes the app wrote — data is rewritten each time it moves down a level. Leveled compaction can be 10–30× on write-heavy loads. High WA burns IO and SSD endurance.
Read amplification SSTables consulted per read — all of L0 plus one per deeper level (minus bloom-filter skips). Climbs with L0 file count. A bloated L0 means both slow reads and an imminent stall.
Space amplification On-disk bytes ÷ live bytes — overwrites and tombstones not yet merged away. Grows when compaction lags or tombstones linger ("compaction debt"). Reclaimed only by merging.
Stall signal p99 write latency, L0 file count, and pending_compaction_bytes over time. p99 spiking + L0 climbing toward the stop trigger = compaction can't keep up. Watch pending bytes as the leading indicator.

Watch out for

Worked example

Start steady: L0 = 4 files, writes flowing. A burst arrives and flushes outrun compaction.

L0=4   flowing     write burst begins, flushes land in L0
L0=8   slowed      crossed slowdown trigger (8) -> each write delayed
L0=12  STALLED     crossed stop trigger (12) -> new writes blocked
   ...  compaction merges L0 files down into L1...
L0=3   flowing     backlog drained below triggers -> writes resume

As L0 climbs from 4 to 8, the engine starts adding a small delay to each write (soft slowdown) — p99 write latency ticks up. At 12 it crosses the hard stop_trigger and blocks new writes entirely; the foreground sees a write stall. Background compaction keeps merging L0 SSTables into L1, the file count falls back to 3, and once it's safely below both triggers the engine resumes admitting writes at full speed.

Check yourself

1. Your p99 write latency just spiked and the L0 SSTable count is climbing toward the stop trigger. What's the most likely cause?

2. You switch from leveled to size-tiered (tiered) compaction to cut IO. What gets worse as a result?