Flushing to disk (fsync and the page cache)

A successful write() only reaches the page cache — your data is durable only after fsync pushes those dirty pages down to the physical disk and waits for the device to confirm.

The idea

There are two buffering layers between your program and stable storage. First the app’s own userspace buffer; flush() moves that into the kernel with a write() syscall. Second the kernel page cache: a write() lands there as a dirty page and returns immediately.

Dirty pages are visible to other readers, but they are not on the disk yet. Only fsync(fd) forces those dirty pages down to the platter and blocks until the device acknowledges. A crash in the gap between write() and fsync loses every page that was still dirty.

cut power while pages are still dirty

Four records wait in the app’s userspace buffer. Press Play to walk the path to durability, or step through one stage at a time.

Warm blocks are dirty — in the page cache, at risk. Green blocks are clean — on the platter and durable. The crash control proves the difference: only fsynced data survives.

How it works

A write() copies bytes into the kernel page cache and returns — fast, but the pages are only marked dirty, not persisted. fsync(fd) is what asks the kernel to push every dirty page for that file down to the device and block until the device says it is on stable media. After fsync returns cleanly, the data survives a power loss.

import os

fd = os.open("data.log", os.O_WRONLY | os.O_CREAT | os.O_APPEND, 0o644)

os.write(fd, record)   # bytes now in the kernel page cache, marked DIRTY
                       # readers can see them, but a crash here loses them

os.fsync(fd)           # force the dirty pages to the platter and WAIT for the
                       # device ack. only now is the record durable.

# fdatasync skips flushing inode metadata (mtime, etc.) when only the data
# matters, so it can be a little cheaper than a full fsync:
os.fdatasync(fd)

# a NEW file also needs its directory entry persisted, or the file itself
# can vanish on crash even though its data was fsynced:
dir_fd = os.open(os.path.dirname("data.log") or ".", os.O_RDONLY)
os.fsync(dir_fd)       # persist the create/rename in the parent directory
os.close(dir_fd)

fdatasync exists because a full fsync also flushes inode metadata such as the modification time; when only the file contents must be durable, fdatasync can skip that extra metadata write. And after creating or renaming a file you must fsync the parent directory too — otherwise the directory entry can be lost on crash and the freshly written file disappears.

Cost / trade-offs

Property	`write()`	`fsync(fd)`
Guarantees	Bytes copied into the page cache; visible to readers	Dirty pages on stable media; device has acknowledged
Latency	Fast — a memory copy, microseconds	Slow — a real disk seek + flush + ack, often milliseconds
Durability	None — lost on power failure	Yes — survives power loss once it returns
Batching	Cheap to call per record	Amortise it: one fsync after a batch (group commit)
Crash behaviour	Un-synced pages vanish	Everything synced before the crash remains

Because fsync is the expensive step, durable systems batch it: many write() calls, then a single fsync — this is group commit. Write-ahead logging and filesystem journals rely on fsync ordering: the log record must be durable before the change it describes is applied, or recovery cannot trust the log.

Watch out for

Assuming write() means durable. A successful write() only reaches the page cache. The bytes are visible to readers but a power loss before fsync loses them. “The call returned” is not “the data is safe.”
Forgetting fsync entirely. Without it the kernel flushes dirty pages lazily on its own schedule (seconds later), so a crash can roll your recent writes back to nothing. Durability needs an explicit fsync at the commit point.
One fsync per record. Each fsync waits for a disk ack, so syncing every record caps you at the device’s sync rate. Batch many records and fsync once — group commit — to amortise the cost.
Not fsyncing the directory. After creating or renaming a file you must fsync the parent directory, or the directory entry can be lost and the file vanishes on crash even though its data was synced.
Lying disks and ignored errors. Some drives acknowledge a flush while data still sits in a volatile write cache, defeating fsync. And an fsync error must never be swallowed — the “fsyncgate” lesson is that a failed fsync can mark pages clean while the write never landed, so on failure you must treat the data as not durable.

Worked example

A database commits a transaction through a write-ahead log. It appends the commit record to the WAL, then fsyncs the WAL before it acknowledges the commit to the client. That ordering is the whole guarantee.

# commit path
append(wal, txn_records)   # write() the log entries -> page cache (dirty)
append(wal, COMMIT marker)

os.fsync(wal_fd)           # block until the WAL is durable on disk

ack_to_client(txn)         # only now: the client is told "committed"

# --- crash + restart ---
# Replay the WAL: every txn whose COMMIT marker was fsynced is re-applied.
# A txn still in the page cache (not yet fsynced, not yet acked) is simply
# dropped -- the client never heard "committed", so losing it is correct.

The contract holds because the ack comes after the fsync. A crash can only lose transactions that were never acknowledged, and recovery replays exactly the ones the client was promised. The data-page changes can be written lazily afterwards — the durable WAL is what makes them safe.

Check yourself

Your write() returned success, then the power dropped before any fsync. Is that record safe?

Your log writer calls fsync after every single record and throughput is terrible. What is the standard fix?