Code Room
On-callMedium
Question
A ticketing service has a known traffic pattern: every weekday at exactly 12:00 a popular daily on-sale opens and traffic spikes 5x within 60 seconds. Every day at 12:00 you get a 3-4 minute window of elevated p99 and some 503s, then it settles. Dashboards show CPU saturating right at 12:00, the HPA scaling up but new pods only becoming ready ~2-3 minutes later (image pull + app warmup), and the load balancer 503ing while pods are unready. No bug, no bad deploy — it's the same every day. Triage and propose durable mitigations.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.