On-callHardoc-g508

Subject Quota exhaustionLevel Senior–Staff~30 minCommon in Networking & APIs interviewsIndustries Technology

Question

Your fleet scales out fine on normal days, but at your two highest-traffic peaks each week, scale-up stalls: new instances launch but take 4-8 minutes to become ready instead of 60 seconds, and a fraction never come up. Instance dashboards are clean (no vCPU/IP quota errors). Digging into instance boot logs you find the bootstrap is retrying calls to STS/IAM (`AssumeRole`) and the secrets manager with throttling errors, but only at peak. Your app's runtime traffic doesn't touch those APIs. How do you triage and mitigate the peak-only scale-up failure?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.