Flushing to disk (fsync and the page cache)

A successful write() only reaches the page cache — your data is durable only after fsync pushes those dirty pages down to the physical disk and waits for the device to confirm.

The idea

There are two buffering layers between your program and stable storage. First the app’s own userspace buffer; flush() moves that into the kernel with a write() syscall. Second the kernel page cache: a write() lands there as a dirty page and returns immediately.

Dirty pages are visible to other readers, but they are not on the disk yet. Only fsync(fd) forces those dirty pages down to the platter and blocks until the device acknowledges. A crash in the gap between write() and fsync loses every page that was still dirty.

cut power while pages are still dirty
Four records wait in the app’s userspace buffer. Press Play to walk the path to durability, or step through one stage at a time.

Warm blocks are dirty — in the page cache, at risk. Green blocks are clean — on the platter and durable. The crash control proves the difference: only fsynced data survives.

How it works

A write() copies bytes into the kernel page cache and returns — fast, but the pages are only marked dirty, not persisted. fsync(fd) is what asks the kernel to push every dirty page for that file down to the device and block until the device says it is on stable media. After fsync returns cleanly, the data survives a power loss.

import os

fd = os.open("data.log", os.O_WRONLY | os.O_CREAT | os.O_APPEND, 0o644)

os.write(fd, record)   # bytes now in the kernel page cache, marked DIRTY
                       # readers can see them, but a crash here loses them

os.fsync(fd)           # force the dirty pages to the platter and WAIT for the
                       # device ack. only now is the record durable.

# fdatasync skips flushing inode metadata (mtime, etc.) when only the data
# matters, so it can be a little cheaper than a full fsync:
os.fdatasync(fd)

# a NEW file also needs its directory entry persisted, or the file itself
# can vanish on crash even though its data was fsynced:
dir_fd = os.open(os.path.dirname("data.log") or ".", os.O_RDONLY)
os.fsync(dir_fd)       # persist the create/rename in the parent directory
os.close(dir_fd)

fdatasync exists because a full fsync also flushes inode metadata such as the modification time; when only the file contents must be durable, fdatasync can skip that extra metadata write. And after creating or renaming a file you must fsync the parent directory too — otherwise the directory entry can be lost on crash and the freshly written file disappears.

Cost / trade-offs

Propertywrite()fsync(fd)
GuaranteesBytes copied into the page cache; visible to readersDirty pages on stable media; device has acknowledged
LatencyFast — a memory copy, microsecondsSlow — a real disk seek + flush + ack, often milliseconds
DurabilityNone — lost on power failureYes — survives power loss once it returns
BatchingCheap to call per recordAmortise it: one fsync after a batch (group commit)
Crash behaviourUn-synced pages vanishEverything synced before the crash remains

Because fsync is the expensive step, durable systems batch it: many write() calls, then a single fsync — this is group commit. Write-ahead logging and filesystem journals rely on fsync ordering: the log record must be durable before the change it describes is applied, or recovery cannot trust the log.

Watch out for

Worked example

A database commits a transaction through a write-ahead log. It appends the commit record to the WAL, then fsyncs the WAL before it acknowledges the commit to the client. That ordering is the whole guarantee.

# commit path
append(wal, txn_records)   # write() the log entries -> page cache (dirty)
append(wal, COMMIT marker)

os.fsync(wal_fd)           # block until the WAL is durable on disk

ack_to_client(txn)         # only now: the client is told "committed"

# --- crash + restart ---
# Replay the WAL: every txn whose COMMIT marker was fsynced is re-applied.
# A txn still in the page cache (not yet fsynced, not yet acked) is simply
# dropped -- the client never heard "committed", so losing it is correct.

The contract holds because the ack comes after the fsync. A crash can only lose transactions that were never acknowledged, and recovery replays exactly the ones the client was promised. The data-page changes can be written lazily afterwards — the durable WAL is what makes them safe.

Check yourself

Your write() returned success, then the power dropped before any fsync. Is that record safe?

Your log writer calls fsync after every single record and throughput is terrible. What is the standard fix?