Writes are cheap appends; the cost is paid later, when one read has to dig through many layers — until compaction tidies up.
A log-structured store makes writes cheap. Every write is first appended to a write-ahead log (WAL) on disk and fsync'd for durability, then applied to an in-memory memtable. When the memtable fills, it flushes to an immutable on-disk SSTable.
Deletes don't erase anything — they append a tombstone, a marker that shadows older values living in lower layers. Over time a single key's versions spread across many SSTables, so a read has to scan newest to oldest until it finds a value or a tombstone. That extra work per logical read is read amplification. Compaction merges SSTables, keeps only the newest version per key, drops tombstones once nothing older can hide beneath them, and so reclaims space and shrinks the read cost.
The whole store is append-only on the write path. put and delete never modify existing data; they add a newer record that shadows the old one. get walks from the freshest place (the memtable) toward the oldest layer and stops at the very first record it finds for the key — even if that record is a tombstone, which means "deleted".
def put(key, value):
wal.append((key, value)); wal.fsync() # durable first
memtable[key] = value # then in memory
def delete(key):
wal.append((key, TOMBSTONE)); wal.fsync()
memtable[key] = TOMBSTONE # a marker, not an erase
def get(key):
# newest source first: memtable, then SSTables newest -> oldest
for src in [memtable] + sstables_newest_first():
if key in src: # one disk read per SSTable
rec = src[key]
return None if rec is TOMBSTONE else rec
return None # never written
def compact(runs):
# merge several runs, keep only the newest version per key
out = {}
for src in runs_newest_first(runs):
for key, rec in src:
if key not in out: # first seen == newest
out[key] = rec
# drop tombstones once no older data can hide beneath this merge
return {k: r for k, r in out.items() if r is not TOMBSTONE}
A bloom filter per SSTable lets get skip a layer that provably can't contain the key, cutting some disk reads — but it can't help once the key really does live low down.
| Operation | Cost |
|---|---|
| Write (put / delete) | O(1) amortized — one WAL append + fsync, one memtable insert |
| Read (get) | O(layers) worst case — memtable plus every SSTable until a hit; bloom filters trim, don't remove |
| Space | Old versions and tombstones linger until compaction reclaims them |
| Trade-off | Fast, durable, sequential writes now — read and space amplification paid back later, by I/O-heavy compaction |
fsync on every write trades latency for safety. Group commit / batched fsync amortizes it across many writes — at the cost of a small window where un-synced writes could be lost.Picture a queue table backed by an LSM store: rows are inserted at the tail and deleted from the head as they're processed. Each delete is a tombstone, not an erasure. After millions of items pass through, the "head" of the queue is buried under millions of tombstones for already-processed rows. A scan for the next live item — SELECT … LIMIT 1 at the head — must read past every one of those tombstones before reaching real data, and latency falls off a cliff.
Triggering compaction merges the runs, drops the obsolete rows and their now-unneeded tombstones, and collapses the head back to a thin layer. The same scan that crawled is fast again. This is exactly why delete-as-a-queue patterns are a known footgun on log-structured engines.
A DELETE just ran on a key. Is the row's data gone from disk?