Question
Your primary Postgres (16) has been degrading all afternoon. Write latency on a hot `orders` table has crept from 4ms to 90ms p99, the table's on-disk size grew 40% since morning, and `pg_stat_user_tables` shows `n_dead_tup` climbing past 12M with `last_autovacuum` stuck at 09:14 this morning. Autovacuum workers are running (you see them in `pg_stat_activity`) but never finishing on this table. A nightly analytics export job started at 09:10 and is still `active`, holding a transaction open. Nothing was deployed today. Walk through how you triage this and what you do to stop the bleeding versus fix it durably.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.