On-callMediumoc-g541

Subject Traffic spikeLevel Entry–Mid~20 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A single service's latency dashboard shows p99 response times climbing from 200ms to several seconds over the last ten minutes, and some requests are now timing out. CPU on the instances is pegged near 100%. The request-rate graph shows traffic roughly tripled in the same window — a link to the product was just shared widely. No deploy happened. How do you triage and keep the service up?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.