LSM Tree Compaction

Cleaning up the mess of append-only log files.

The idea

High-write databases (Cassandra, RocksDB, DynamoDB) use LSM Trees (Log-Structured Merge Trees). Instead of updating rows in place on disk (which is slow), they just append new data to small, immutable files called SSTables. If you update a user's score 5 times, there are 5 different records scattered across 5 files! This makes writes blazing fast, but reads become terribly slow. To fix this, a background process called Compaction continuously merges small SSTables into larger ones, throwing away the old, obsolete duplicate records.

Step 1: Two small SSTables exist on disk. Notice that Bob's score was updated from 10 to 50.

How it works (Merge Sort)

Because SSTables are Sorted String Tables (the keys are in alphabetical order), Compaction is incredibly efficient. It doesn't need to load the whole files into RAM. It just streams them in using a classic Two-Pointer Merge Sort, dropping keys with older timestamps as it writes the new file to disk.

# The core logic of LSM Compaction
def compact(file1, file2, new_file):
    # Both files are already sorted!
    row1 = file1.read_next()
    row2 = file2.read_next()
    
    while row1 and row2:
        if row1.key == row2.key:
            # Duplicate found! Keep only the newest timestamp
            winner = max(row1, row2, key=lambda r: r.timestamp)
            new_file.write(winner)
            row1, row2 = file1.read_next(), file2.read_next()
        elif row1.key < row2.key:
            new_file.write(row1)
            row1 = file1.read_next()
        else:
            new_file.write(row2)
            row2 = file2.read_next()

Cost

Compaction causes Write Amplification. A single 1KB update from a user might be rewritten to disk 5 or 6 times over its lifetime as it participates in larger and larger compactions. Under heavy load, Compaction can completely saturate the disk's IOPS, causing the database to briefly freeze (a "Compaction Stall") until the disk catches up.

Watch out for

Tombstones: How do you delete a row in an append-only system? You insert a "Tombstone" marker (e.g. Bob: DELETE). If you query Bob, the database sees the tombstone and returns nothing. Compaction is the only time tombstones are actually erased from the disk, freeing up physical space.