On-callHardoc-g417

Subject Disk fullLevel Senior–Staff~40 minCommon in Databases & SQL · Storage & CDN · Distributed systems interviewsIndustries Technology, Software development

Question

A Postgres primary's data volume creeps from 60% to 100% over two days and then refuses writes with 'No space left on device'. pg_wal/ has ballooned, but write traffic and checkpoint frequency are normal and unchanged — WAL generation rate is flat. `pg_replication_slots` shows one logical replication slot whose `restart_lsn` hasn't advanced in ~48 hours; that slot feeds a CDC/analytics consumer that silently died after a deploy two days ago. The primary can't recycle WAL past the stuck slot. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.