Reading a huge file with mmap

Hand the whole file to the kernel as addresses, and pay for disk reads only when you actually touch a page.

The idea

mmap maps a large file straight into your process's virtual address space. Nothing is read from disk up front — the page-table entries for that range start out empty, so the file "exists" as addresses long before any bytes are in memory.

The first time your program touches a page, the CPU raises a page fault. The kernel steps in, reads exactly that 4 KB page from disk into RAM, fills in the page-table entry, and your read resumes as if nothing happened. So you pay I/O lazily — per page, only for what you actually touch. Better still, when memory gets tight the OS can quietly evict a clean (unmodified) page without your help, because it can always re-read it from the file later.

Press play, or step through, to watch pages fault in from disk.

How it works

You map the file once, then index into it like an array. Reads of untouched regions trigger a fault the kernel handles transparently; you never call read() per page yourself.

import mmap

with open("huge.log", "rb") as f:
    # Map the whole file read-only. No bytes are read yet —
    # the kernel just sets up empty page-table entries.
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # First touch of this slice raises a page fault: the kernel
    # reads exactly the 4 KB page(s) covering [1_000_000_000 : +64]
    # from disk into RAM, then the read continues. Pages we never
    # reference are never read from disk at all.
    chunk = mm[1_000_000_000 : 1_000_000_000 + 64]

    idx = mm.find(b"ERROR")   # scans, faulting pages in lazily as it goes
    mm.close()

Cost / signals

Aspect	What you see
First touch of a page	A stall: a page fault, then a disk read of one 4 KB page before the access completes.
Warm re-read	Page is already resident in RAM — no fault, near memory speed.
Minor vs major fault	Minor: page already in the cache, just wire up the mapping. Major: must actually hit the disk.
Memory pressure	OS evicts clean pages with no writeback (it can re-read them); touching them again re-faults.
Signal to watch	Major-fault count (`majflt` in `ps -o maj_flt`, `/proc/<pid>/stat`) or page-fault rate climbing in your metrics.

Watch out for

Random access on a file larger than RAM turns into constant major faults — the working set won't fit, so pages keep getting evicted and re-read. That is thrashing, and it can be slower than a plain sequential read().
mmap doesn't free you from durability. For writable mappings, dirty pages aren't on disk until the kernel flushes them — you still need msync() / fsync() before you can claim the write is durable.
"Preloading" by touching every page defeats the whole point. Walking the entire file just to "warm" it forces every page to fault in, throwing away the laziness you mapped for. Use madvise(WILLNEED/SEQUENTIAL) if you want hints instead.
SIGBUS if the file is truncated under you. If another process shrinks the file, touching a now-out-of-range page raises SIGBUS, not a clean end-of-file — your process can crash mid-access.
Address-space limits on 32-bit. A 4 GB virtual address space can't map a 10 GB file at all; you're forced into windowed mmap or plain reads.

Worked example

Say you grep for one rare string in a 10 GB log via mmap on a 16 GB box. The map succeeds instantly with zero I/O. As the scan sweeps forward, pages fault in 4 KB at a time; the OS keeps recently-read pages cached and evicts the oldest clean ones when the cache fills. Only the regions you actually scanned ever leave the disk, and if a match sits near the end, the pages before it were read exactly once and then dropped — roughly the same I/O as a streaming read, but with simple array-style code.

Now read random 8-byte offsets from a 100 GB file on that same 16 GB box. Every offset likely lands on a cold page, so almost every access is a major fault: a full disk seek for 8 useful bytes, and the page you just paid for gets evicted before you reach it again. Throughput collapses to disk-seek speed — the classic mmap thrashing trap.

Check yourself

1. You mmap a 10 GB file, then read just 64 bytes from the middle. How much is read from disk?

2. On a 16 GB machine, which access pattern over a 100 GB mmap'd file is most likely to thrash?