Code Room
On-callHard
Question
Your fleet scales out fine on normal days, but at your two highest-traffic peaks each week, scale-up stalls: new instances launch but take 4-8 minutes to become ready instead of 60 seconds, and a fraction never come up. Instance dashboards are clean (no vCPU/IP quota errors). Digging into instance boot logs you find the bootstrap is retrying calls to STS/IAM (`AssumeRole`) and the secrets manager with throttling errors, but only at peak. Your app's runtime traffic doesn't touch those APIs. How do you triage and mitigate the peak-only scale-up failure?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.