On-callMediumoc-g042

Subject Latency spikesLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A ticketing service has a known traffic pattern: every weekday at exactly 12:00 a popular daily on-sale opens and traffic spikes 5x within 60 seconds. Every day at 12:00 you get a 3-4 minute window of elevated p99 and some 503s, then it settles. Dashboards show CPU saturating right at 12:00, the HPA scaling up but new pods only becoming ready ~2-3 minutes later (image pull + app warmup), and the load balancer 503ing while pods are unready. No bug, no bad deploy — it's the same every day. Triage and propose durable mitigations.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.