Code Room
On-callMedium
Question
A long-running Docker host (used as a self-hosted runner) starts failing container starts with 'no space left on device' on /var/lib/docker, and `df -h` confirms that filesystem is full. The host has been up for months running thousands of CI jobs. `docker system df` shows large 'RECLAIMABLE' figures: dozens of GB in dangling images, stopped containers, anonymous volumes, and build cache. Image churn rose after the team switched to multi-stage builds last month. No application bug. Triage and remediate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.