Question
Over the past week a once-fast endpoint backed by a high-churn `sessions` table (constant INSERT/UPDATE/DELETE) has degraded: queries that read 1,000 rows now take 10x longer and the table+index on-disk size has grown 5x even though the live row count is flat at ~2M. I/O on the DB is up and the buffer cache hit ratio dropped. `pg_stat_user_tables` shows `n_dead_tup` is huge (40M dead vs 2M live) and `last_autovacuum` is days old. Recent context: a long-running reporting transaction has been kept open most of the week, and someone bumped `autovacuum_vacuum_cost_delay` higher last month to reduce I/O. Triage and resolve.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.