Durability

A write isn’t really safe until it has survived the fsync — buffers can vanish in a power cut.

The idea

Durability is the D in ACID: once you tell a client “your write committed,” it must survive a crash or power loss. The trap is that the layers between your code and the platter lie — write() returns the moment the bytes land in the OS page cache, long before they reach stable storage.

The fix is a write-ahead log (WAL): append the change to a log file and fsync it to disk before you acknowledge the client. fsync is the one call that forces buffered bytes through to durable media. Cross-machine durability comes from replication — ack only after N replicas hold the record.

See it work

A write request is about to begin. Press play, or step through it — and try a crash at any stage.

How it works

The WAL is append-only, so writes are sequential — fast even on spinning disks. The rule is simple: append the record, fsync the log, then ack. fsync (or fdatasync) is what actually forces the buffered bytes onto stable media. On restart, recovery replays the WAL to re-apply any committed records that hadn’t yet been folded into the main data files (checkpointed). To amortise the fsync cost, databases use group commit: batch many pending writes into one flush.

# commit path — durable before we ack
def commit(record):
    append(wal, record)      # bytes now in the OS page cache (volatile!)
    fsync(wal)               # forces them to stable storage — blocks here
    ack(client)              # only now is it safe to say "committed"

# recovery on restart — replay the log
def recover():
    for rec in read(wal):
        if not already_applied(rec):
            apply(rec)       # committed-but-not-checkpointed writes restored

For machine failure (not just process/power loss), replicate: ship the record to followers and ack only once N of them have it on durable storage. Note the subtle layer below fsync — a disk’s own write cache can still hold data in volatile RAM unless write barriers / FUA are honoured.

Cost

Signal	What it costs / buys you
fsync latency	The durability tax. SSD ~tens of µs to a few ms; HDD seek+rotate is ms. Every commit pays it unless batched.
Throughput vs durability	Group commit / fewer flushes → higher throughput, but a bigger window of writes that vanish on a crash.
Durability scope	A single-node fsync survives process crash, OS crash and power loss. It does not survive that machine dying — replication does.
Missing fsync	Acking from the page cache is fast and feels fine in testing — until a power cut silently drops acknowledged writes.

Watch out for

Acking before fsync. Returning success from the page cache is fast, but a power loss drops those committed writes. This is the classic post-mortem: “we lost data after the outage.”
fsync vs fdatasync — and the directory. fdatasync skips metadata; fine for an existing file’s data. But for a newly created file you must also fsync the containing directory, or the file may not exist after a crash.
The disk lied. A drive or controller write cache can ignore fsync and ack from its own volatile RAM unless write barriers / FUA are honoured. Then even a “correct” fsync loses data on power loss.
Replication-to-memory isn’t durability. If all replicas only hold the write in RAM, a correlated power loss (whole rack, whole AZ) loses it. Something, somewhere, must have fsync’d it.
Swallowed fsync errors (“fsyncgate”). A failed fsync can drop the dirty data, and on some kernels the error doesn’t resurface on retry. Treat an fsync error as fatal for that data.

Worked example

A write reaches the page cache and power dies before fsync. The record is gone — but the client was never acked, so correctness holds: from its point of view the write never committed, and it can safely retry.

Re-run it. This time the record is appended and fsync returns, so it’s on stable storage. Power dies before the ack reaches the client. The client is in doubt (did it commit?), so it retries. On restart, recovery replays the WAL and the write is present. Because the write carries the same idempotency key, the retry is a no-op — no duplicate. Durability held; the client just couldn’t tell.

crash BEFORE fsync:  in page cache → LOST, but never acked  → safe to retry
crash AFTER  fsync:  on stable WAL → SURVIVES (replayed)    → in-doubt, retry idempotently
crash AFTER  ack:    on stable WAL → guaranteed present

Check yourself

Your database calls write() to its log file and immediately acks the client, planning to fsync later in a batch. The machine loses power before that batch fsync. What happens to the acked writes?

A leader replicates each write to two followers and acks once both have it in memory. None of the three has fsync’d yet when a power event takes down the whole rack. Was that write durable?