The storage stack

Every read and write travels down a ladder of layers — each one slower but more durable than the last. The whole craft of storage is keeping the data you reach for most in the fast layers near the top.

The idea

Think of how you keep things at home. The notebook open on your desk is grabbed in a heartbeat. The drawer beside you takes a moment. The filing cabinet in the closet takes a short walk. The safe-deposit box at the bank takes a whole trip — but it's the one place nothing ever gets lost.

A storage system is built the same way, as a stack of four layers: client → cache → storage engine → disk. Each step down is slower but larger and more durable. Reads try the cache first — a hit returns in nanoseconds; a miss falls all the way through to disk and pays milliseconds. Writes go through the engine, land in a durable journal so a crash can't lose them, then refresh the cache.

Because a miss can be a thousand times slower than a hit, the entire game is keeping the hot working set in the layers near the top.

Press play, or step through, to watch a read miss, a read hit, and a write travel the stack.

How it works

A read tries the cache, and only pays for the slow layers on a miss — then it populates the cache so the next read is a hit. A write logs to a durable journal first so a crash can never lose it, then updates the engine and refreshes the cache.

def read(key):
    val = cache.get(key)
    if val is not None:
        return val                 # cache hit  (~100 ns)
    val = engine.lookup(key)       # miss -> storage engine + disk (~ms)
    cache.put(key, val)            # populate so next read is a hit
    return val

def write(key, value):
    wal.append(key, value)         # durable journal first (crash-safe)
    engine.apply(key, value)       # update the index / pages
    cache.put(key, value)          # keep cache coherent
    return Ack()

Cost

Latency grows by orders of magnitude as you descend. Faster layers are small and volatile (they vanish on a restart); slower layers are large and durable.

Layer	Typical latency	Size	Survives a crash?
Cache (memory / Redis)	`~100 ns`	GBs	No — volatile
Storage engine (buffer pool)	`~1 µs`	GBs	No — volatile
SSD	`~100 µs`	TBs	Yes — durable
HDD	`~5–10 ms`	TBs	Yes — durable

A read served from cache is roughly 50,000× faster than one that falls through to an HDD. That gap is why hit rate — not raw disk speed — usually decides your p99.

Watch out for

Cache invalidation. A write that updates the engine but forgets to refresh or invalidate the cache leaves stale data sitting in the fast layer. Reads then quietly return the old value.
Read-after-write consistency. If a write hasn't propagated to the cache yet, a user who just saved a change can read back the old one. Update or invalidate on write so people see their own edits.
Cold cache after restart. A fresh process starts with an empty cache, so a burst of traffic stampedes straight to disk — a thundering herd. Warm the cache, or coalesce duplicate misses.
Write-back without a journal. A write-back cache that acknowledges before the data is durable can lose recent writes on a crash. The WAL exists precisely so the durable copy is written first.
Assuming uniform latency. A miss is ~1000× a hit, so a small dip in hit rate can dominate tail latency. Reason about the distribution, not the average.

Worked example

A profile page asks for user 42. The very first load misses the cache and pays a ~5 ms disk read; the value is copied into the cache on the way back up.

The next 10,000 loads of that same profile are cache hits at ~100 ns each — effectively free. One slow read bought tens of thousands of fast ones.

Then the user edits their bio. The write appends to the WAL first (so a crash mid-write can't lose it), updates the engine, and refreshes the cache with the new value — so the very next read is both fresh and fast. That single discipline — journal, then apply, then cache — is what keeps the layers honest.

Check yourself

A cache miss on this stack falls through to disk and pays ~5 ms, while a hit returns in ~100 ns. Your service runs at a 90% hit rate, and someone proposes buying faster disks to cut tail latency. What actually moves p99 the most?