Question
A Postgres 13 cluster started logging `WARNING: database "prod" must be vacuumed within 12000000 transactions` and the app is up but you're told it may stop accepting writes soon. `SELECT datname, age(datfrozenxid) FROM pg_database` shows the main DB at 2.05B and rising. There are several aggressive autovacuum workers running but they keep getting cancelled — the logs show `canceling autovacuum of table X to prevent deadlock` style messages around a nightly batch that runs `ALTER TABLE` and bulk loads. Describe your triage, what happens if you do nothing, the emergency mitigation, and the prevention.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.