How modern databases achieve massive write throughput using LSM Trees.
Writing directly to disk randomly (like updating a B-Tree in place) is very slow. Modern databases (Cassandra, RocksDB, LevelDB) use Log-Structured Merge (LSM) Trees.
Writes first go to a fast in-memory structure called a Memtable. When it's full, it is flushed to disk as an immutable, sorted file (SSTable). Over time, many SSTables pile up. A background Compaction process merges these files, discarding old versions and tombstones (deleted items), creating new merged files.
def compact(sstable_1, sstable_2):
# Both are sorted lists of (key, timestamp, value)
# We do a linear merge, keeping only the newest value per key!
i, j = 0, 0
merged = []
while i < len(sstable_1) and j < len(sstable_2):
k1, t1, v1 = sstable_1[i]
k2, t2, v2 = sstable_2[j]
if k1 < k2:
merged.append((k1, t1, v1))
i += 1
elif k2 < k1:
merged.append((k2, t2, v2))
j += 1
else:
# Keys match! Keep the one with the higher timestamp
if t1 > t2:
merged.append((k1, t1, v1))
else:
merged.append((k2, t2, v2))
i += 1
j += 1
return merged