Code Room
On-callMedium
Question
Your image-resize endpoint runs on a serverless platform that scales to zero when idle and scales out one new instance per unit of concurrency. A marketing email goes out at 10:00 and within seconds thousands of users click through. Almost every one of those first requests takes 3–5s instead of the usual 150ms, and a chunk of them time out at the CDN. After ~30 seconds latency returns to normal. Logs show a flood of `init` / `coldstart` markers at 10:00:0x. There was no deploy. How do you triage and prevent the next email from doing this?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.