WAL Replay (Crash Recovery)

How a database survives having its power cord ripped out.

The idea

Writing directly to database tables on a hard drive is slow, so databases hold your data in RAM (Buffer Pool) first. But if the power goes out, everything in RAM is deleted! To fix this, databases use a Write-Ahead Log (WAL). Before saying "Commit Success" to the user, the database appends the change to a simple, sequential log file on the hard drive. Appending to a file is incredibly fast. If the power cord is pulled, the data in RAM is lost, but upon reboot, the database just reads the WAL and "replays" the events to put the RAM back exactly how it was.

Step 1: The App sends an UPDATE. The DB writes it to the WAL on disk FIRST.

How it works (Checkpoints)

If a database runs for 5 years, the WAL would be petabytes long. To prevent this, the database periodically runs a Checkpoint. It takes all the data currently in RAM, safely writes it to the permanent Table files on disk, and then deletes the old WAL. Now, if the database crashes, it only has to replay the WAL from the last Checkpoint.

-- The Lifecycle of a Transaction
1. App: "UPDATE users SET score = 100 WHERE id = 1;"
2. DB: Writes "[id=1, score=100]" to the WAL (Disk).
3. DB: Updates the row in the Buffer Pool (RAM).
4. DB: Responds "200 OK" to App.

-- (Later...)
5. DB Background Thread: Writes Buffer Pool to actual Tables (Disk).

Cost

Sequential disk writes (WAL) are about 10x to 100x faster than random disk writes (updating actual table B-Trees). However, if you have a massive spike in traffic, the DB will write to the WAL incredibly fast, filling up the disk before a Checkpoint can run. This is why databases require fast SSDs specifically for their WAL directory (often called pg_wal or redo logs).

Watch out for