On-callHardoc-g221

Subject Vacuum bloatLevel Senior–Staff~40 minCommon in Databases & SQL interviewsIndustries Technology, Software development

Question

Postgres 15 primary. Over the last 18 hours, write latency on a single high-churn table (`events`) has crept from 4 ms to 70 ms and the table's on-disk size grew 3x even though row count is flat. `pg_stat_user_tables` shows `n_dead_tup` at 240M and climbing, `last_autovacuum` is NULL for that table, and `autovacuum_count` = 0 since yesterday. Other tables vacuum fine. `pg_stat_activity` shows a `idle in transaction` connection from an analytics job opened 19 hours ago holding a transaction. CPU and IO are not saturated. Walk through how you triage and mitigate, then prevent recurrence.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.