Code Room
On-callMedium
Question
Tickets for a major concert go on sale at exactly 10:00. At 10:00:00 traffic jumps 50x instantly. Your ALB starts returning 503s, the autoscaler is adding instances but they take ~4 minutes to boot and pass health checks, and during those 4 minutes the small initial fleet is overwhelmed and crashes. The queue page itself is dynamic and hits the same backend as checkout. Nothing is broken in code; you simply got a wall of traffic at a known instant. How do you triage in the moment and design for the next on-sale?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.