A write isn’t really safe until it has survived the fsync — buffers can vanish in a power cut.
Durability is the D in ACID: once you tell a client “your write committed,” it must survive a crash or power loss. The trap is that the layers between your code and the platter lie — write() returns the moment the bytes land in the OS page cache, long before they reach stable storage.
The fix is a write-ahead log (WAL): append the change to a log file and fsync it to disk before you acknowledge the client. fsync is the one call that forces buffered bytes through to durable media. Cross-machine durability comes from replication — ack only after N replicas hold the record.
The WAL is append-only, so writes are sequential — fast even on spinning disks. The rule is simple: append the record, fsync the log, then ack. fsync (or fdatasync) is what actually forces the buffered bytes onto stable media. On restart, recovery replays the WAL to re-apply any committed records that hadn’t yet been folded into the main data files (checkpointed). To amortise the fsync cost, databases use group commit: batch many pending writes into one flush.
# commit path — durable before we ack
def commit(record):
append(wal, record) # bytes now in the OS page cache (volatile!)
fsync(wal) # forces them to stable storage — blocks here
ack(client) # only now is it safe to say "committed"
# recovery on restart — replay the log
def recover():
for rec in read(wal):
if not already_applied(rec):
apply(rec) # committed-but-not-checkpointed writes restored
For machine failure (not just process/power loss), replicate: ship the record to followers and ack only once N of them have it on durable storage. Note the subtle layer below fsync — a disk’s own write cache can still hold data in volatile RAM unless write barriers / FUA are honoured.
| Signal | What it costs / buys you |
|---|---|
| fsync latency | The durability tax. SSD ~tens of µs to a few ms; HDD seek+rotate is ms. Every commit pays it unless batched. |
| Throughput vs durability | Group commit / fewer flushes → higher throughput, but a bigger window of writes that vanish on a crash. |
| Durability scope | A single-node fsync survives process crash, OS crash and power loss. It does not survive that machine dying — replication does. |
| Missing fsync | Acking from the page cache is fast and feels fine in testing — until a power cut silently drops acknowledged writes. |
fdatasync skips metadata; fine for an existing file’s data. But for a newly created file you must also fsync the containing directory, or the file may not exist after a crash.A write reaches the page cache and power dies before fsync. The record is gone — but the client was never acked, so correctness holds: from its point of view the write never committed, and it can safely retry.
Re-run it. This time the record is appended and fsync returns, so it’s on stable storage. Power dies before the ack reaches the client. The client is in doubt (did it commit?), so it retries. On restart, recovery replays the WAL and the write is present. Because the write carries the same idempotency key, the retry is a no-op — no duplicate. Durability held; the client just couldn’t tell.
crash BEFORE fsync: in page cache → LOST, but never acked → safe to retry
crash AFTER fsync: on stable WAL → SURVIVES (replayed) → in-doubt, retry idempotently
crash AFTER ack: on stable WAL → guaranteed present
Your database calls write() to its log file and immediately acks the client, planning to fsync later in a batch. The machine loses power before that batch fsync. What happens to the acked writes?
A leader replicates each write to two followers and acks once both have it in memory. None of the three has fsync’d yet when a power event takes down the whole rack. Was that write durable?