On-callMediumoc-g320

Subject Cold startLevel Mid–Senior~24 minCommon in Concurrency interviewsIndustries Technology

Question

Your image-resize endpoint runs on a serverless platform that scales to zero when idle and scales out one new instance per unit of concurrency. A marketing email goes out at 10:00 and within seconds thousands of users click through. Almost every one of those first requests takes 3–5s instead of the usual 150ms, and a chunk of them time out at the CDN. After ~30 seconds latency returns to normal. Logs show a flood of `init` / `coldstart` markers at 10:00:0x. There was no deploy. How do you triage and prevent the next email from doing this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.