Question
The API runs in us-east-1, us-west-2, and eu-west-1 behind latency-based DNS. A config change last Tuesday raised the DB connection-pool max from 50 to 200 per pod to handle a planned campaign. The campaign launches today. eu-west-1 is fine. us-east-1 and us-west-2 both start throwing `FATAL: remaining connection slots are reserved for non-replication superuser connections` under load; p99 spikes, then partial outage. Dashboards: each region's pod count and per-pod pool size look identical in the deploy manifest, but the Postgres `max_connections` differs — eu-west-1's primary was resized to a larger instance class last quarter (max_connections 800), the other two are still on the old class (max_connections 200). How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.