On-callHardoc-g614

Subject Query plan regressionLevel Senior–Staff~40 minCommon in Databases & SQL interviewsIndustries Technology

Question

No deploy, no traffic change — yet at 09:15 one hot endpoint (product search by category + price range) goes from 5ms to 1.2s, and the DB's overall CPU and I/O jump sharply. The slow-query log lights up with that one query. `EXPLAIN ANALYZE` now shows a sequential scan over a 40M-row table where it used to use an index. Stats: the table had a large nightly bulk-load of new SKUs at 02:00, autovacuum/ANALYZE hasn't run since, and the planner's row estimate for the predicate is off by 100x (estimates 50 rows, actually returns 2M then filters). Nothing else changed. Triage and fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.