On-callHardoc-g410

Subject OomLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A Java export service (limit 4Gi) that was rock-solid for months suddenly gets OOMKilled (exit 137) twice this afternoon, each time within seconds of a *single* request — not a slow climb. Memory dashboards show a flat ~1.2GB baseline with sudden vertical spikes to the limit right before each kill. There was no deploy. The access log shows that just before each crash, one new enterprise customer called the 'export all records' endpoint with no date filter, returning ~8 million rows that the service materializes fully into a list and serializes in memory before responding. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.