Cleaning up the mess of append-only log files.
High-write databases (Cassandra, RocksDB, DynamoDB) use LSM Trees (Log-Structured Merge Trees). Instead of updating rows in place on disk (which is slow), they just append new data to small, immutable files called SSTables. If you update a user's score 5 times, there are 5 different records scattered across 5 files! This makes writes blazing fast, but reads become terribly slow. To fix this, a background process called Compaction continuously merges small SSTables into larger ones, throwing away the old, obsolete duplicate records.
Because SSTables are Sorted String Tables (the keys are in alphabetical order), Compaction is incredibly efficient. It doesn't need to load the whole files into RAM. It just streams them in using a classic Two-Pointer Merge Sort, dropping keys with older timestamps as it writes the new file to disk.
# The core logic of LSM Compaction
def compact(file1, file2, new_file):
# Both files are already sorted!
row1 = file1.read_next()
row2 = file2.read_next()
while row1 and row2:
if row1.key == row2.key:
# Duplicate found! Keep only the newest timestamp
winner = max(row1, row2, key=lambda r: r.timestamp)
new_file.write(winner)
row1, row2 = file1.read_next(), file2.read_next()
elif row1.key < row2.key:
new_file.write(row1)
row1 = file1.read_next()
else:
new_file.write(row2)
row2 = file2.read_next()
Compaction causes Write Amplification. A single 1KB update from a user might be rewritten to disk 5 or 6 times over its lifetime as it participates in larger and larger compactions. Under heavy load, Compaction can completely saturate the disk's IOPS, causing the database to briefly freeze (a "Compaction Stall") until the disk catches up.
Bob: DELETE). If you query Bob, the database sees the tombstone and returns nothing. Compaction is the only time tombstones are actually erased from the disk, freeing up physical space.