Compaction stalls

An LSM-tree storage engine throttles your writes when the background merge falls behind the incoming flood.

The idea

Log-structured merge (LSM) engines like RocksDB, Cassandra, and LevelDB take writes fast by never updating data in place. A write lands in an in-memory memtable; when it fills, the engine flushes it as a brand-new immutable file — an SSTable — into the top level, L0.

Those files pile up. A background process called compaction merges them downward into bigger, sorted levels, dropping overwritten keys and tombstones along the way. But compaction is slower than a heavy write burst. When L0 files accumulate past a threshold, the engine protects itself by applying backpressure: first it slows incoming writes, then it stalls them entirely until compaction drains the backlog.

See it work

Press play to watch a write burst push L0 toward the stall threshold.

How it works

The write path is append-only and tiered:

1. WAL + memtable. Every write is appended to a write-ahead log (for crash recovery) and inserted into a sorted in-memory memtable. 2. Flush. When the memtable fills, it becomes immutable and is flushed to disk as a new L0 SSTable — a sorted, self-contained file. 3. Compaction. A background job merges overlapping SSTables into the next level down (L0→L1→L2), producing fewer, larger, non-overlapping files and physically dropping overwritten keys and deletion tombstones.

The catch is read amplification: a point lookup may have to check every L0 file (they have overlapping key ranges) plus one file per deeper level. Bloom filters skip files that can't contain the key, but more files still means slower reads. Compaction keeps file count and read amplification bounded — so when it falls behind, the engine throttles writes rather than let reads rot.

RocksDB gates writes on the L0 file count with two triggers (level0_slowdown_writes_trigger and level0_stop_writes_trigger) and on a soft/hard_pending_compaction_bytes_limit:

# Called on every flush, before admitting more writes.
# Defaults shown are RocksDB-style; tune per workload.
slowdown_trigger = 20   # level0_slowdown_writes_trigger
stop_trigger     = 36   # level0_stop_writes_trigger

def admit_write(db):
    n = db.num_l0_files()
    if n >= stop_trigger or db.pending_compaction_bytes() >= hard_limit:
        block_writes()          # hard stall: foreground writes wait
    elif n >= slowdown_trigger or db.pending_compaction_bytes() >= soft_limit:
        rate_limit_writes()     # soft slowdown: delay each write a little
    else:
        admit()                 # healthy: write flows at full speed

(The visual above uses small thresholds — slowdown at L0 ≥ 8, stop at L0 ≥ 12 — so the pile-up and drain are easy to see in a few steps.)

Cost / signals

Factor	What it measures	How to read it
Write amplification	Bytes written to disk ÷ bytes the app wrote — data is rewritten each time it moves down a level.	Leveled compaction can be 10–30× on write-heavy loads. High WA burns IO and SSD endurance.
Read amplification	SSTables consulted per read — all of `L0` plus one per deeper level (minus bloom-filter skips).	Climbs with `L0` file count. A bloated `L0` means both slow reads and an imminent stall.
Space amplification	On-disk bytes ÷ live bytes — overwrites and tombstones not yet merged away.	Grows when compaction lags or tombstones linger ("compaction debt"). Reclaimed only by merging.
Stall signal	p99 write latency, `L0` file count, and `pending_compaction_bytes` over time.	p99 spiking + `L0` climbing toward the stop trigger = compaction can't keep up. Watch pending bytes as the leading indicator.

Watch out for

Too few compaction threads. If max_background_compactions is undersized, a write burst outruns the mergers and you stall under load. Give compaction enough threads (and IO budget) to drain bursts.
Over-tuned compaction. Maxing out compaction aggressiveness burns disk IO and starves foreground reads/writes of bandwidth. Tune for headroom, don't pin it to the ceiling.
Compaction debt from tombstones / TTL. Deleted or expired data only disappears when a compaction rewrites the file. If those compactions never get scheduled, space and read amplification keep growing even though the data is logically gone.
Bulk load into L0. Dumping many files straight into L0 (or a flood of flushes) can blow past the stop trigger instantly and stall the very next write. Use ingest-to-lower-level or rate-limit the load.
Treating the stall as the only signal. By the time writes are blocked, you're already in trouble. Alert on pending_compaction_bytes growing and p99 write latency creeping up — those lead the stall.
Tiered vs leveled tradeoff. Tiered (size-tiered) compaction has lower write amplification but higher read and space amplification; leveled is the opposite. Pick to match read- vs write-heavy workloads.

Worked example

Start steady: L0 = 4 files, writes flowing. A burst arrives and flushes outrun compaction.

L0=4   flowing     write burst begins, flushes land in L0
L0=8   slowed      crossed slowdown trigger (8) -> each write delayed
L0=12  STALLED     crossed stop trigger (12) -> new writes blocked
   ...  compaction merges L0 files down into L1...
L0=3   flowing     backlog drained below triggers -> writes resume

As L0 climbs from 4 to 8, the engine starts adding a small delay to each write (soft slowdown) — p99 write latency ticks up. At 12 it crosses the hard stop_trigger and blocks new writes entirely; the foreground sees a write stall. Background compaction keeps merging L0 SSTables into L1, the file count falls back to 3, and once it's safely below both triggers the engine resumes admitting writes at full speed.

Check yourself

1. Your p99 write latency just spiked and the L0 SSTable count is climbing toward the stop trigger. What's the most likely cause?

2. You switch from leveled to size-tiered (tiered) compaction to cut IO. What gets worse as a result?