On-callHardoc-g419

Subject Vacuum bloatLevel Senior–Staff~40 minCommon in Databases & SQL interviewsIndustries Technology

Question

Your primary Postgres (16) has been degrading all afternoon. Write latency on a hot `orders` table has crept from 4ms to 90ms p99, the table's on-disk size grew 40% since morning, and `pg_stat_user_tables` shows `n_dead_tup` climbing past 12M with `last_autovacuum` stuck at 09:14 this morning. Autovacuum workers are running (you see them in `pg_stat_activity`) but never finishing on this table. A nightly analytics export job started at 09:10 and is still `active`, holding a transaction open. Nothing was deployed today. Walk through how you triage this and what you do to stop the bleeding versus fix it durably.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.