Code Room
On-callEasy
Question
You're on call for a single web API. Ten minutes ago the error-rate dashboard jumped from ~0.1% to ~40% of requests returning HTTP 500, and latency on the success path is unchanged. The deploy timeline shows a new version went out 12 minutes ago; nothing else changed (no traffic spike, no dependency alerts). The on-call channel is filling up with customer reports. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.