Code Room
On-callHardoc-g512
Subject Quota exhaustionLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

You run the same service in three regions behind latency-based DNS. During an outage at a peer CDN, traffic shifts and us-east-1 takes a 3x surge while eu-west-1 and ap-southeast-1 stay flat. us-east-1 fails to scale: new instances launch but a chunk never reach service and the ASG log shows a service-quota error. The other two regions are healthy and well under the same nominal limits. Your global capacity dashboard (which aggregates across regions) shows plenty of headroom. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.