File I/O: buffered vs direct

Buffered I/O routes reads and writes through the kernel page cache so repeats are fast; direct I/O skips the cache and talks straight to the disk.

The idea

When your program reads or writes a file, the data does not usually go straight to the platter. By default it passes through the kernel page cache, a region of RAM that holds recently touched file pages. A write() copies into the cache and returns immediately; a read() is served from the cache if the page is already there.

This buffered path makes repeats fast and batches writes, but the data is not durable the moment write() returns — it is only durable after a flush. Direct I/O (O_DIRECT) bypasses the cache entirely, moving bytes straight between your buffer and the disk. Databases choose it so they are not double-cached and can manage their own buffer pool.

mode:

Buffered mode: reads and writes pass through the kernel page cache. Press Play, or step through the sequence.

Green is the fast or safe path: a cache hit, a clean page, a write that reached disk. Warm is the slow or unsafe path: a cache miss faulting from disk, or a dirty page that is not yet durable.

How it works

A buffered write() copies your bytes into a page in the kernel page cache and marks that page dirty, then returns — the data is in RAM, not on disk. The kernel writes dirty pages back later (on its own schedule, on fsync(), or under memory pressure). A buffered read() is a cache hit when the page is resident (served from RAM, no disk touch) and a cache miss when it is not (a page fault pulls it from disk into the cache, then to you — the next read of that page hits).

Direct I/O opens the file with O_DIRECT. Reads and writes then DMA straight between your application buffer and the disk, skipping the page cache. There is no read-ahead and no write-back batching, but there is also no double-caching — which is exactly what a database wants when it runs its own buffer pool.

# --- buffered (the default) ---
fd = os.open("data.bin", os.O_WRONLY | os.O_CREAT, 0o644)
os.write(fd, payload)   # copies into the page cache, marks pages dirty, returns
os.fsync(fd)            # NOW force the dirty pages to disk — only here is it durable
os.close(fd)

# --- direct I/O (bypass the page cache) ---
# O_DIRECT requires the buffer address, file offset, and length to be
# aligned to the device block size (often 512 B or 4096 B).
ALIGN = 4096
fd  = os.open("data.bin", os.O_WRONLY | os.O_CREAT | os.O_DIRECT, 0o644)
buf = mmap.mmap(-1, ALIGN)              # page-aligned buffer, length is a multiple of ALIGN
buf.write(payload.ljust(ALIGN, b"\x00"))
os.pwrite(fd, buf, 0)                   # DMA app buffer -> disk, no page cache
os.fsync(fd)                            # still fsync: O_DIRECT skips the cache, not the device write cache
os.close(fd)

Note the alignment: with O_DIRECT the buffer address, the file offset, and the transfer length must all be multiples of the device block size, or the call fails with EINVAL. And direct I/O still needs fsync() for full durability — bypassing the page cache is not the same as flushing the drive’s own write cache.

Signals

Property	Buffered I/O	Direct I/O (O_DIRECT)
Read speed	Fast on cache hits (served from RAM); read-ahead helps streaming	Always touches disk — no cache speedup
Write latency	Low — `write()` returns once it is in the cache	Higher — waits on the device transfer
Durable on return	No — needs `fsync()`; a crash loses dirty pages	No — still needs `fsync()` for the device cache
Memory use	Page cache holds copies; risk of double-caching	No kernel copy — app owns its buffer pool
Alignment	None — any buffer, offset, length	Buffer, offset, length must be block-aligned
Typical user	General apps, build tools, anything sequential	Databases managing their own cache (InnoDB, many engines)

Neither mode is durable the instant a write call returns. The page cache is a speed layer, not a persistence layer — fsync() is what makes data survive a power loss in both modes.

Watch out for

Buffered write() is not durable until fsync(). It returns once the bytes are in the page cache. A power loss before write-back drops every dirty page. If the data must survive a crash, call fsync() (or open with O_SYNC) and check its return value.
O_DIRECT needs alignment. The buffer address, file offset, and length must each be a multiple of the device block size (often 512 B or 4096 B). Pass an unaligned buffer and the call fails with EINVAL — allocate with posix_memalign or a page-aligned mmap.
Double caching wastes RAM. If your application keeps its own cache and reads buffered, every hot page lives twice — once in your cache, once in the kernel’s. That is the main reason a database opts into O_DIRECT.
The cache-miss latency cliff. Buffered reads are microseconds on a hit and milliseconds on a miss (a disk fault). A workload that fits in cache looks fast until it grows past RAM, then falls off a cliff. Size for the miss path, not the hit path.
Benchmarks lie when the cache is warm. Assuming read() always hits the disk is wrong — a re-read may be served entirely from RAM. Drop caches (or use O_DIRECT) before measuring true disk throughput, or you will benchmark memory, not storage.

Worked example

A database stores its tables in data files and runs its own buffer pool — a carefully tuned in-process cache of the hottest pages. If it read those files buffered, every hot page would sit in RAM twice: once in the buffer pool, once in the kernel page cache. So it opens the data files with O_DIRECT and lets its own cache be the single source of truth.

# data files: direct, so the kernel does not shadow the buffer pool
data_fd = os.open("table.dat",
                  os.O_RDWR | os.O_DIRECT, 0o644)

# the buffer pool reads and writes through aligned buffers
page = read_page(data_fd, page_no)      # DMA disk -> aligned buffer, no double-cache
modify(page)
write_page(data_fd, page_no, page)      # DMA aligned buffer -> disk

# durability is still explicit — O_DIRECT skips the cache, not the commit
os.fsync(data_fd)                       # flush the device so the page survives a crash

Two ideas live together here: O_DIRECT avoids the wasted second copy, and fsync() still guarantees durability. Bypassing the page cache buys predictable, un-double-cached I/O; it does not, on its own, make a write safe across a power loss.

Check yourself

A buffered write() just returned successfully. The machine loses power one millisecond later, before any fsync(). Is the data safe on disk?

Why does a database open its data files with O_DIRECT instead of reading them buffered?