Code Room
On-callHardoc-g222
Subject Vacuum bloatLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A Postgres 13 cluster started logging `WARNING: database "prod" must be vacuumed within 12000000 transactions` and the app is up but you're told it may stop accepting writes soon. `SELECT datname, age(datfrozenxid) FROM pg_database` shows the main DB at 2.05B and rising. There are several aggressive autovacuum workers running but they keep getting cancelled — the logs show `canceling autovacuum of table X to prevent deadlock` style messages around a nightly batch that runs `ALTER TABLE` and bulk loads. Describe your triage, what happens if you do nothing, the emergency mitigation, and the prevention.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.