Code Room
On-callHardoc-g509
Subject Capacity incidentsLevel Senior–Staff~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

To absorb your nightly dinner-rush spike you keep a warm pool of 40 pre-booted, pre-warmed instances that the ASG promotes into service in seconds. It has worked for months. Tonight you shipped a routine deploy at 17:30, and then at 18:00 the dinner ramp hit and scale-up was slow again — 4-minute cold boots, 503s, the exact symptom the warm pool was meant to prevent. The warm pool shows 0 available instances. CPU and the app are otherwise healthy. What happened, how do you mitigate right now, and how do you prevent it?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.