On-callMediumoc-g653

Subject Snapshot bloat disk pressureLevel Mid–Senior~30 minCommon in Storage & CDN interviewsIndustries Technology

Question

Your database runs on a copy-on-write filesystem/volume (ZFS/LVM-thin/cloud snapshots) and takes hourly snapshots for point-in-time recovery, retained 30 days. On-call is paged: the data volume is filling fast and projected to hit 100% in ~5 hours, threatening writes. Dashboards: actual live dataset size is roughly flat at ~2 TB, but total volume usage climbed from 4 TB to 9 TB over the past week; snapshot count is normal (720), but per-snapshot 'unique referenced' bytes are huge and growing; a data-retention/GC job that rewrites or deletes large swaths of old rows ran heavily this week; nothing was actually deleted from disk. How do you triage snapshot bloat / space amplification that's about to fill the volume?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.