Question
Postgres 15 primary. Over the last 18 hours, write latency on a single high-churn table (`events`) has crept from 4 ms to 70 ms and the table's on-disk size grew 3x even though row count is flat. `pg_stat_user_tables` shows `n_dead_tup` at 240M and climbing, `last_autovacuum` is NULL for that table, and `autovacuum_count` = 0 since yesterday. Other tables vacuum fine. `pg_stat_activity` shows a `idle in transaction` connection from an analytics job opened 19 hours ago holding a transaction. CPU and IO are not saturated. Walk through how you triage and mitigate, then prevent recurrence.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.