Question
No deploy, no traffic change — yet at 09:15 one hot endpoint (product search by category + price range) goes from 5ms to 1.2s, and the DB's overall CPU and I/O jump sharply. The slow-query log lights up with that one query. `EXPLAIN ANALYZE` now shows a sequential scan over a 40M-row table where it used to use an index. Stats: the table had a large nightly bulk-load of new SKUs at 02:00, autovacuum/ANALYZE hasn't run since, and the planner's row estimate for the predicate is off by 100x (estimates 50 rows, actually returns 2M then filters). Nothing else changed. Triage and fix.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.