On-callMediumoc-g531

Subject Oom killLevel Entry–Mid~20 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A service running on Kubernetes is in a restart loop. The pod dashboard shows containers being killed with reason 'OOMKilled' every 20–30 minutes, and memory usage climbs steadily from startup until it hits the container limit and the pod is killed and restarted. Request latency spikes each time a pod cycles. A new version shipped two days ago. CPU and request volume are normal. How do you triage and stop the churn?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.