Question
At 02:40 a key API endpoint's p99 jumped from 120ms to 6s and the DB's CPU is pinned. No code or schema deploy went out. Looking at `pg_stat_statements`, one query — a join between `orders` and `customers` filtered by a recent `created_at` window — went from a few ms to seconds and now dominates total time. An autovacuum/auto-analyze ran on `orders` around 02:35. `EXPLAIN ANALYZE` shows the planner now picking a nested loop with a sequential scan on `customers`, estimating 1 row but actually touching millions. Yesterday the same query used a hash join. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.