On-callMediumoc-g327

Subject Traffic surgeLevel Mid–Senior~24 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A surprise viral moment drives 8x normal traffic to your storefront API. The autoscaler ramps quickly and then flatlines: the fleet is pinned at its configured maximum of 100 instances, every instance is at ~95% CPU, and excess requests queue and time out. The scaling event log shows repeated 'desired exceeds max, capped at 100' messages. This is real, legitimate traffic — no retry storm, no misconfig. How do you triage and keep the store usable through the surge?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.