Code Room
On-callMedium
Question
At 12:40 your reporting service's Postgres read replica spikes to 100% CPU and stays there; every query on that replica times out. pg_stat_activity shows one query running for 14 minutes that the app never normally runs that long: a user-built ad-hoc report joining three large tables with a LIKE '%...%' free-text filter and no LIMIT, generated by a new self-serve report builder shipped this morning. The same query keeps reappearing because the user keeps hitting 'retry' after each timeout. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.