On-callMediumoc-g550

Subject On callLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your latency-sensitive checkout service runs on a shared Kubernetes node pool. At 16:00 checkout p99 degrades from 80ms to 700ms — but only on some pods, and it's intermittent: the same pod is fine for minutes then spikes. No deploy to checkout. Error rate is flat; it's purely latency. Node metrics show CPU on the affected nodes is fine on average but you see frequent CPU throttling events on the checkout containers, and one node is running a newly-scheduled batch analytics job (a Spark executor pod) that has no CPU limits set and is pegging cores in bursts. The checkout pods on nodes WITHOUT the analytics pod are healthy. Walk through your triage and the fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.