Code Room
On-callHard
Question
A Postgres primary's data volume creeps from 60% to 100% over two days and then refuses writes with 'No space left on device'. pg_wal/ has ballooned, but write traffic and checkpoint frequency are normal and unchanged — WAL generation rate is flat. `pg_replication_slots` shows one logical replication slot whose `restart_lsn` hasn't advanced in ~48 hours; that slot feeds a CDC/analytics consumer that silently died after a deploy two days ago. The primary can't recycle WAL past the stuck slot. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.