Snapshot bloat and disk pressure

Snapshots are cheap to take but they keep old blocks alive — take enough and never prune, and they quietly eat the disk until writes start failing.

The idea

A copy-on-write snapshot captures a point-in-time view almost for free: it shares every block with the live volume and only diverges as the live data changes. The catch is that each retained snapshot pins the old version of every block the live volume later rewrites, so those old blocks can never be freed.

Take snapshots frequently and never prune them, and disk usage grows with churn × retention — not with the size of the live data, which can sit flat. When free space crosses a threshold the system hits disk pressure: writes slow, then fail, the database may flip to read-only, and the node may be evicted. The fix is a retention policy: prune or merge old snapshots, and alert on free space and snapshot count, not just live-data size.

See it work

Press play to watch it run.

How it works

A copy-on-write write never overwrites a block that a snapshot still references. Instead it copies the old block aside (the snapshot keeps pointing at the copy) and writes the new data to a fresh block. A retention loop reclaims space only by dropping snapshots old enough that no live snapshot still references their pinned blocks.

// copy-on-write: never clobber a block a snapshot still needs
function cow_write(volume, block_id, new_data):
    if any_snapshot_references(block_id):       // shared with a snapshot
        old = volume.blocks[block_id]
        copy = allocate_new_block()             // pins old version on disk
        copy.data = old.data
        for snap in snapshots_referencing(block_id):
            snap.remap(block_id -> copy)        // snapshot keeps the old view
    volume.blocks[block_id] = write(new_data)   // live volume diverges

// retention: prune old snapshots, free only un-shared blocks
function prune(snapshots, max_age_days):
    for snap in snapshots:
        if snap.age > max_age_days:
            drop(snap)
    for blk in pinned_blocks():
        if not any_snapshot_references(blk):    // nothing left needs it
            free(blk)                           // space finally returns

Trade-offs

AspectCostSignal to watch
Snapshot creationO(1) — just a new reference, no data copiedSnapshot count climbing without bound
SpaceGrows with churn × retention, not live-data sizeUsed space far above live-data size
PerformanceCopy-on-write write amplification and fragmentationWrite latency creeping up over time
DeletionPruning frees only blocks no remaining snapshot sharesReclaimed space far below the snapshot’s logical size
Recovery timePoint-in-time restore is fast, but each kept point costs spaceRetention depth vs free-space headroom

Watch out for

Worked example

Take a 100 GB volume with 5% daily churn and an hourly snapshot kept for 30 days. Each day rewrites about 5 GB of blocks; under a snapshot, every rewrite pins the old version, so roughly 5 GB of new pinned space accrues per day on top of the steady 100 GB of live data. Over the 30-day retention window that is about 30 × 5 = 150 GB of snapshot-pinned blocks — about 250 GB used total while the live data never leaves 100 GB.

On a 256 GB volume that crosses the warning watermark within roughly two weeks and fills near day 30, which is exactly the curve in the animation: the live-data band stays flat while the snapshot band climbs tick by tick, tips the bar from healthy green to warning warm and into disk pressure — until a prune drops the oldest snapshots and their uniquely-pinned blocks finally free.

Check yourself

Live data is steady at 100 GB, but the disk keeps filling toward full. What is the most likely cause?

You delete the oldest snapshot to recover space, but barely any frees. Why?