On-callHardoc-g421

Subject Query plan regressionLevel Senior–Staff~35 minCommon in Databases & SQL interviewsIndustries Technology

Question

At 02:40 a key API endpoint's p99 jumped from 120ms to 6s and the DB's CPU is pinned. No code or schema deploy went out. Looking at `pg_stat_statements`, one query — a join between `orders` and `customers` filtered by a recent `created_at` window — went from a few ms to seconds and now dominates total time. An autovacuum/auto-analyze ran on `orders` around 02:35. `EXPLAIN ANALYZE` shows the planner now picking a nested loop with a sequential scan on `customers`, estimating 1 row but actually touching millions. Yesterday the same query used a hash join. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.