A successful write() only reaches the page cache — your data is durable only after fsync pushes those dirty pages down to the physical disk and waits for the device to confirm.
There are two buffering layers between your program and stable storage. First the app’s own userspace buffer; flush() moves that into the kernel with a write() syscall. Second the kernel page cache: a write() lands there as a dirty page and returns immediately.
Dirty pages are visible to other readers, but they are not on the disk yet. Only fsync(fd) forces those dirty pages down to the platter and blocks until the device acknowledges. A crash in the gap between write() and fsync loses every page that was still dirty.
Warm blocks are dirty — in the page cache, at risk. Green blocks are clean — on the platter and durable. The crash control proves the difference: only fsynced data survives.
A write() copies bytes into the kernel page cache and returns — fast, but the pages are only marked dirty, not persisted. fsync(fd) is what asks the kernel to push every dirty page for that file down to the device and block until the device says it is on stable media. After fsync returns cleanly, the data survives a power loss.
import os
fd = os.open("data.log", os.O_WRONLY | os.O_CREAT | os.O_APPEND, 0o644)
os.write(fd, record) # bytes now in the kernel page cache, marked DIRTY
# readers can see them, but a crash here loses them
os.fsync(fd) # force the dirty pages to the platter and WAIT for the
# device ack. only now is the record durable.
# fdatasync skips flushing inode metadata (mtime, etc.) when only the data
# matters, so it can be a little cheaper than a full fsync:
os.fdatasync(fd)
# a NEW file also needs its directory entry persisted, or the file itself
# can vanish on crash even though its data was fsynced:
dir_fd = os.open(os.path.dirname("data.log") or ".", os.O_RDONLY)
os.fsync(dir_fd) # persist the create/rename in the parent directory
os.close(dir_fd)
fdatasync exists because a full fsync also flushes inode metadata such as the modification time; when only the file contents must be durable, fdatasync can skip that extra metadata write. And after creating or renaming a file you must fsync the parent directory too — otherwise the directory entry can be lost on crash and the freshly written file disappears.
| Property | write() | fsync(fd) |
|---|---|---|
| Guarantees | Bytes copied into the page cache; visible to readers | Dirty pages on stable media; device has acknowledged |
| Latency | Fast — a memory copy, microseconds | Slow — a real disk seek + flush + ack, often milliseconds |
| Durability | None — lost on power failure | Yes — survives power loss once it returns |
| Batching | Cheap to call per record | Amortise it: one fsync after a batch (group commit) |
| Crash behaviour | Un-synced pages vanish | Everything synced before the crash remains |
Because fsync is the expensive step, durable systems batch it: many write() calls, then a single fsync — this is group commit. Write-ahead logging and filesystem journals rely on fsync ordering: the log record must be durable before the change it describes is applied, or recovery cannot trust the log.
write() means durable. A successful write() only reaches the page cache. The bytes are visible to readers but a power loss before fsync loses them. “The call returned” is not “the data is safe.”fsync entirely. Without it the kernel flushes dirty pages lazily on its own schedule (seconds later), so a crash can roll your recent writes back to nothing. Durability needs an explicit fsync at the commit point.fsync per record. Each fsync waits for a disk ack, so syncing every record caps you at the device’s sync rate. Batch many records and fsync once — group commit — to amortise the cost.fsync the parent directory, or the directory entry can be lost and the file vanishes on crash even though its data was synced.fsync. And an fsync error must never be swallowed — the “fsyncgate” lesson is that a failed fsync can mark pages clean while the write never landed, so on failure you must treat the data as not durable.A database commits a transaction through a write-ahead log. It appends the commit record to the WAL, then fsyncs the WAL before it acknowledges the commit to the client. That ordering is the whole guarantee.
# commit path
append(wal, txn_records) # write() the log entries -> page cache (dirty)
append(wal, COMMIT marker)
os.fsync(wal_fd) # block until the WAL is durable on disk
ack_to_client(txn) # only now: the client is told "committed"
# --- crash + restart ---
# Replay the WAL: every txn whose COMMIT marker was fsynced is re-applied.
# A txn still in the page cache (not yet fsynced, not yet acked) is simply
# dropped -- the client never heard "committed", so losing it is correct.
The contract holds because the ack comes after the fsync. A crash can only lose transactions that were never acknowledged, and recovery replays exactly the ones the client was promised. The data-page changes can be written lazily afterwards — the durable WAL is what makes them safe.
Your write() returned success, then the power dropped before any fsync. Is that record safe?
Your log writer calls fsync after every single record and throughput is terrible. What is the standard fix?