On-callHardoc-g516

Subject Traffic surgeLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

A mobile-app release this week added an aggressive client-side retry to a flaky-feeling endpoint. Today a genuine 2x organic surge hits and your backend, which has handled larger real surges before, falls over: it's pinned at 100% CPU, the load balancer shows incoming request rate ~7x baseline (far above the 2x of real users), and a chunk of that is duplicate requests carrying the same idempotency key seconds apart. Pulling the app open-rate, actual user activity is only up ~2x. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.