Code Room
On-callHard
Question
A primary PostgreSQL instance's data volume fills to 100% at 04:00 and the database refuses writes (it shuts down to protect itself). Disk dashboards show steady growth on pg_wal/ over the last 12 hours specifically — table data size is flat. A read replica went offline for maintenance at 16:00 yesterday and never reconnected, and a logical replication slot for a downstream analytics CDC pipeline shows a replication-lag/slot-retained-bytes metric climbing all night. Normal table churn is unchanged. Triage and recover.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.