Code Room
On-callHard
Question
On a bare-metal host running a Kafka broker plus several colocated agents, the kernel OOM-killer kills a random colocated agent (not Kafka) at 21:00, taking out log shipping. `dmesg` shows 'Out of memory: Killed process' with a high oom_score for the victim. The Kafka JVM heap is a modest 6GB but the host's overall 'available' memory hit near zero; most RAM is consumed by page cache from Kafka's heavy mmap'd segment I/O, and a batch backfill job started at 20:30 allocating large anonymous buffers. `vm.overcommit_memory` is default. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.