Code Room
On-callMedium
Question
A service running on Kubernetes is in a restart loop. The pod dashboard shows containers being killed with reason 'OOMKilled' every 20–30 minutes, and memory usage climbs steadily from startup until it hits the container limit and the pod is killed and restarted. Request latency spikes each time a pod cycles. A new version shipped two days ago. CPU and request volume are normal. How do you triage and stop the churn?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.