On-callHardoc-g190

Subject OomLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Software development

Question

On a bare-metal host running a Kafka broker plus several colocated agents, the kernel OOM-killer kills a random colocated agent (not Kafka) at 21:00, taking out log shipping. `dmesg` shows 'Out of memory: Killed process' with a high oom_score for the victim. The Kafka JVM heap is a modest 6GB but the host's overall 'available' memory hit near zero; most RAM is consumed by page cache from Kafka's heavy mmap'd segment I/O, and a batch backfill job started at 20:30 allocating large anonymous buffers. `vm.overcommit_memory` is default. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.