Reading a huge file with mmap

Hand the whole file to the kernel as addresses, and pay for disk reads only when you actually touch a page.

The idea

mmap maps a large file straight into your process's virtual address space. Nothing is read from disk up front — the page-table entries for that range start out empty, so the file "exists" as addresses long before any bytes are in memory.

The first time your program touches a page, the CPU raises a page fault. The kernel steps in, reads exactly that 4 KB page from disk into RAM, fills in the page-table entry, and your read resumes as if nothing happened. So you pay I/O lazily — per page, only for what you actually touch. Better still, when memory gets tight the OS can quietly evict a clean (unmodified) page without your help, because it can always re-read it from the file later.

Press play, or step through, to watch pages fault in from disk.

How it works

You map the file once, then index into it like an array. Reads of untouched regions trigger a fault the kernel handles transparently; you never call read() per page yourself.

import mmap

with open("huge.log", "rb") as f:
    # Map the whole file read-only. No bytes are read yet —
    # the kernel just sets up empty page-table entries.
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # First touch of this slice raises a page fault: the kernel
    # reads exactly the 4 KB page(s) covering [1_000_000_000 : +64]
    # from disk into RAM, then the read continues. Pages we never
    # reference are never read from disk at all.
    chunk = mm[1_000_000_000 : 1_000_000_000 + 64]

    idx = mm.find(b"ERROR")   # scans, faulting pages in lazily as it goes
    mm.close()

Cost / signals

AspectWhat you see
First touch of a pageA stall: a page fault, then a disk read of one 4 KB page before the access completes.
Warm re-readPage is already resident in RAM — no fault, near memory speed.
Minor vs major faultMinor: page already in the cache, just wire up the mapping. Major: must actually hit the disk.
Memory pressureOS evicts clean pages with no writeback (it can re-read them); touching them again re-faults.
Signal to watchMajor-fault count (majflt in ps -o maj_flt, /proc/<pid>/stat) or page-fault rate climbing in your metrics.

Watch out for

Worked example

Say you grep for one rare string in a 10 GB log via mmap on a 16 GB box. The map succeeds instantly with zero I/O. As the scan sweeps forward, pages fault in 4 KB at a time; the OS keeps recently-read pages cached and evicts the oldest clean ones when the cache fills. Only the regions you actually scanned ever leave the disk, and if a match sits near the end, the pages before it were read exactly once and then dropped — roughly the same I/O as a streaming read, but with simple array-style code.

Now read random 8-byte offsets from a 100 GB file on that same 16 GB box. Every offset likely lands on a cold page, so almost every access is a major fault: a full disk seek for 8 useful bytes, and the page you just paid for gets evicted before you reach it again. Throughput collapses to disk-seek speed — the classic mmap thrashing trap.

Check yourself

1. You mmap a 10 GB file, then read just 64 bytes from the middle. How much is read from disk?

2. On a 16 GB machine, which access pattern over a 100 GB mmap'd file is most likely to thrash?