Code Room
On-callMedium
Question
A single service's latency dashboard shows p99 response times climbing from 200ms to several seconds over the last ten minutes, and some requests are now timing out. CPU on the instances is pegged near 100%. The request-rate graph shows traffic roughly tripled in the same window — a link to the product was just shared widely. No deploy happened. How do you triage and keep the service up?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.