Buffered I/O routes reads and writes through the kernel page cache so repeats are fast; direct I/O skips the cache and talks straight to the disk.
When your program reads or writes a file, the data does not usually go straight to the platter. By default it passes through the kernel page cache, a region of RAM that holds recently touched file pages. A write() copies into the cache and returns immediately; a read() is served from the cache if the page is already there.
This buffered path makes repeats fast and batches writes, but the data is not durable the moment write() returns — it is only durable after a flush. Direct I/O (O_DIRECT) bypasses the cache entirely, moving bytes straight between your buffer and the disk. Databases choose it so they are not double-cached and can manage their own buffer pool.
Green is the fast or safe path: a cache hit, a clean page, a write that reached disk. Warm is the slow or unsafe path: a cache miss faulting from disk, or a dirty page that is not yet durable.
A buffered write() copies your bytes into a page in the kernel page cache and marks that page dirty, then returns — the data is in RAM, not on disk. The kernel writes dirty pages back later (on its own schedule, on fsync(), or under memory pressure). A buffered read() is a cache hit when the page is resident (served from RAM, no disk touch) and a cache miss when it is not (a page fault pulls it from disk into the cache, then to you — the next read of that page hits).
Direct I/O opens the file with O_DIRECT. Reads and writes then DMA straight between your application buffer and the disk, skipping the page cache. There is no read-ahead and no write-back batching, but there is also no double-caching — which is exactly what a database wants when it runs its own buffer pool.
# --- buffered (the default) ---
fd = os.open("data.bin", os.O_WRONLY | os.O_CREAT, 0o644)
os.write(fd, payload) # copies into the page cache, marks pages dirty, returns
os.fsync(fd) # NOW force the dirty pages to disk — only here is it durable
os.close(fd)
# --- direct I/O (bypass the page cache) ---
# O_DIRECT requires the buffer address, file offset, and length to be
# aligned to the device block size (often 512 B or 4096 B).
ALIGN = 4096
fd = os.open("data.bin", os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
buf = mmap.mmap(-1, ALIGN) # page-aligned buffer, length is a multiple of ALIGN
buf.write(payload.ljust(ALIGN, b"\x00"))
os.pwrite(fd, buf, 0) # DMA app buffer -> disk, no page cache
os.fsync(fd) # still fsync: O_DIRECT skips the cache, not the device write cache
os.close(fd)
Note the alignment: with O_DIRECT the buffer address, the file offset, and the transfer length must all be multiples of the device block size, or the call fails with EINVAL. And direct I/O still needs fsync() for full durability — bypassing the page cache is not the same as flushing the drive’s own write cache.
| Property | Buffered I/O | Direct I/O (O_DIRECT) |
|---|---|---|
| Read speed | Fast on cache hits (served from RAM); read-ahead helps streaming | Always touches disk — no cache speedup |
| Write latency | Low — write() returns once it is in the cache | Higher — waits on the device transfer |
| Durable on return | No — needs fsync(); a crash loses dirty pages | No — still needs fsync() for the device cache |
| Memory use | Page cache holds copies; risk of double-caching | No kernel copy — app owns its buffer pool |
| Alignment | None — any buffer, offset, length | Buffer, offset, length must be block-aligned |
| Typical user | General apps, build tools, anything sequential | Databases managing their own cache (InnoDB, many engines) |
Neither mode is durable the instant a write call returns. The page cache is a speed layer, not a persistence layer — fsync() is what makes data survive a power loss in both modes.
write() is not durable until fsync(). It returns once the bytes are in the page cache. A power loss before write-back drops every dirty page. If the data must survive a crash, call fsync() (or open with O_SYNC) and check its return value.O_DIRECT needs alignment. The buffer address, file offset, and length must each be a multiple of the device block size (often 512 B or 4096 B). Pass an unaligned buffer and the call fails with EINVAL — allocate with posix_memalign or a page-aligned mmap.O_DIRECT.read() always hits the disk is wrong — a re-read may be served entirely from RAM. Drop caches (or use O_DIRECT) before measuring true disk throughput, or you will benchmark memory, not storage.A database stores its tables in data files and runs its own buffer pool — a carefully tuned in-process cache of the hottest pages. If it read those files buffered, every hot page would sit in RAM twice: once in the buffer pool, once in the kernel page cache. So it opens the data files with O_DIRECT and lets its own cache be the single source of truth.
# data files: direct, so the kernel does not shadow the buffer pool
data_fd = os.open("table.dat",
os.O_RDWR | os.O_DIRECT, 0o644)
# the buffer pool reads and writes through aligned buffers
page = read_page(data_fd, page_no) # DMA disk -> aligned buffer, no double-cache
modify(page)
write_page(data_fd, page_no, page) # DMA aligned buffer -> disk
# durability is still explicit — O_DIRECT skips the cache, not the commit
os.fsync(data_fd) # flush the device so the page survives a crash
Two ideas live together here: O_DIRECT avoids the wasted second copy, and fsync() still guarantees durability. Bypassing the page cache buys predictable, un-double-cached I/O; it does not, on its own, make a write safe across a power loss.
A buffered write() just returned successfully. The machine loses power one millisecond later, before any fsync(). Is the data safe on disk?
Why does a database open its data files with O_DIRECT instead of reading them buffered?