Code Room
On-callHard
Question
A mobile-app release this week added an aggressive client-side retry to a flaky-feeling endpoint. Today a genuine 2x organic surge hits and your backend, which has handled larger real surges before, falls over: it's pinned at 100% CPU, the load balancer shows incoming request rate ~7x baseline (far above the 2x of real users), and a chunk of that is duplicate requests carrying the same idempotency key seconds apart. Pulling the app open-rate, actual user activity is only up ~2x. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.