On-callMediumoc-g022

Subject Query of deathLevel Mid–Senior~30 minCommon in Databases & SQL interviewsIndustries Technology, Software development

Question

At 12:40 your reporting service's Postgres read replica spikes to 100% CPU and stays there; every query on that replica times out. pg_stat_activity shows one query running for 14 minutes that the app never normally runs that long: a user-built ad-hoc report joining three large tables with a LIKE '%...%' free-text filter and no LIMIT, generated by a new self-serve report builder shipped this morning. The same query keeps reappearing because the user keeps hitting 'retry' after each timeout. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.