Question
Your food-delivery API gets a predictable dinner-rush ramp from 18:00. Every evening at 18:05 you get a 4–6 minute window of elevated latency and 503s before things settle. The autoscaler does eventually add capacity, but new VM-based instances take ~3–4 minutes to boot, pull the image, and pass health checks. Dashboards show: requests ramping steeply from 18:00, the desired-instance count rising at 18:03, but the ready-instance count lagging several minutes behind, with the gap exactly matching the bad window. Each evening the same shape repeats. How do you triage and eliminate the nightly window?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.