On-callMediumoc-g316

Subject Traffic surgeLevel Mid–Senior~25 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your food-delivery API gets a predictable dinner-rush ramp from 18:00. Every evening at 18:05 you get a 4–6 minute window of elevated latency and 503s before things settle. The autoscaler does eventually add capacity, but new VM-based instances take ~3–4 minutes to boot, pull the image, and pass health checks. Dashboards show: requests ramping steeply from 18:00, the desired-instance count rising at 18:03, but the ready-instance count lagging several minutes behind, with the gap exactly matching the bad window. Each evening the same shape repeats. How do you triage and eliminate the nightly window?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.