On-callMediumoc-g177

Subject OomLevel Mid–Senior~30 minCommon in Algorithms & data structures interviewsIndustries Software development

Question

A Spring Boot payments service running on Kubernetes started CrashLooping at 02:00. Dashboards show pods restarting every ~4 minutes with exit code 137; container memory climbs steadily from 600MB to the 1Gi limit then drops to zero on restart. JVM heap-used (from Micrometer) only reaches ~480MB before each kill, well under the -Xmx of 768m. GC pause time and old-gen occupancy look normal. There was no deploy; traffic is flat versus last week. The only recent change was an infra ticket that lowered the pod memory limit from 1.5Gi to 1Gi the previous afternoon. Walk through how you triage and mitigate this.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.