Hand the whole file to the kernel as addresses, and pay for disk reads only when you actually touch a page.
mmap maps a large file straight into your process's virtual address space. Nothing is read from disk up front — the page-table entries for that range start out empty, so the file "exists" as addresses long before any bytes are in memory.
The first time your program touches a page, the CPU raises a page fault. The kernel steps in, reads exactly that 4 KB page from disk into RAM, fills in the page-table entry, and your read resumes as if nothing happened. So you pay I/O lazily — per page, only for what you actually touch. Better still, when memory gets tight the OS can quietly evict a clean (unmodified) page without your help, because it can always re-read it from the file later.
You map the file once, then index into it like an array. Reads of untouched regions trigger a fault the kernel handles transparently; you never call read() per page yourself.
import mmap
with open("huge.log", "rb") as f:
# Map the whole file read-only. No bytes are read yet —
# the kernel just sets up empty page-table entries.
mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
# First touch of this slice raises a page fault: the kernel
# reads exactly the 4 KB page(s) covering [1_000_000_000 : +64]
# from disk into RAM, then the read continues. Pages we never
# reference are never read from disk at all.
chunk = mm[1_000_000_000 : 1_000_000_000 + 64]
idx = mm.find(b"ERROR") # scans, faulting pages in lazily as it goes
mm.close()
| Aspect | What you see |
|---|---|
| First touch of a page | A stall: a page fault, then a disk read of one 4 KB page before the access completes. |
| Warm re-read | Page is already resident in RAM — no fault, near memory speed. |
| Minor vs major fault | Minor: page already in the cache, just wire up the mapping. Major: must actually hit the disk. |
| Memory pressure | OS evicts clean pages with no writeback (it can re-read them); touching them again re-faults. |
| Signal to watch | Major-fault count (majflt in ps -o maj_flt, /proc/<pid>/stat) or page-fault rate climbing in your metrics. |
read().msync() / fsync() before you can claim the write is durable.madvise(WILLNEED/SEQUENTIAL) if you want hints instead.SIGBUS, not a clean end-of-file — your process can crash mid-access.Say you grep for one rare string in a 10 GB log via mmap on a 16 GB box. The map succeeds instantly with zero I/O. As the scan sweeps forward, pages fault in 4 KB at a time; the OS keeps recently-read pages cached and evicts the oldest clean ones when the cache fills. Only the regions you actually scanned ever leave the disk, and if a match sits near the end, the pages before it were read exactly once and then dropped — roughly the same I/O as a streaming read, but with simple array-style code.
Now read random 8-byte offsets from a 100 GB file on that same 16 GB box. Every offset likely lands on a cold page, so almost every access is a major fault: a full disk seek for 8 useful bytes, and the page you just paid for gets evicted before you reach it again. Throughput collapses to disk-seek speed — the classic mmap thrashing trap.
1. You mmap a 10 GB file, then read just 64 bytes from the middle. How much is read from disk?
2. On a 16 GB machine, which access pattern over a 100 GB mmap'd file is most likely to thrash?