When the kernel kills your process

Memory climbs toward a hard line. The moment it crosses, the kernel picks a victim and sends one quiet, fatal signal.

The idea

Every container runs under a hard memory limit — a Kubernetes pod limit, a cgroup ceiling, or just the physical RAM of the host. As your process allocates memory, its resident set (RSS) climbs toward that line. There is no graceful warning when it gets close.

When usage would cross the limit, the kernel's OOM killer chooses a victim by its oom_score and sends it SIGKILL. The process dies instantly — no cleanup, no stack trace — exiting with code 137 (that is 128 + 9, where 9 is SIGKILL). If something restarts it, it climbs and dies again: a CrashLoopBackOff.

See it work

Press play to watch the workload allocate memory step by step.

How it works

A leak is just memory you allocate and never release. Here a request handler keeps appending to a module-level list, so RSS only grows — it never falls back down between requests.

cache = []  # module-level — never cleared

def handle(request):
    # each request appends ~5 MB and keeps the reference,
    # so the garbage collector can never reclaim it.
    cache.append(load_batch(request))   # leak
    return summarize(cache[-1])

When the cgroup's memory limit is exceeded, the kernel scans candidate tasks, scores each one (higher oom_score = bigger, less protected = more likely victim; oom_score_adj nudges it), and kills the worst. You see it after the fact:

# dmesg / journalctl on the node
Out of memory: Killed process 4127 (python) total-vm:812044kB,
  anon-rss:524288kB ... oom_score_adj:0

# kubectl describe pod api-7f9c
    Last State:   Terminated
      Reason:     OOMKilled
      Exit Code:  137            # 128 + 9 (SIGKILL)
    State:        Waiting
      Reason:     CrashLoopBackOff

Trade-offs

Remedy	Cost	Time to fix	Durability
Raise the limit	More RAM per pod, fewer pods per node	Minutes	None for a true leak — only delays the crash
Fix the leak	Engineering time, profiling	Hours to days	Permanent — RSS stops growing
Add backpressure	Bounded buffers, slower under load	Hours	Strong — bounds memory regardless of input size
Set requests < limits	Risk of eviction under node pressure	Minutes	Schedules safely, but bursting above request can still OOM

Watch out for

Exit 137 is silent — SIGKILL can't be caught or logged, so there's no stack trace or shutdown hook. The only evidence is in dmesg / the pod's last state.
The OOM killer may kill a sibling, not the culprit. It picks by oom_score, so a well-behaved large process can be the victim of a small leaker next to it.
Native / off-heap memory isn't bounded by the JVM's -Xmx. Direct buffers, threads, and JNI live outside the heap, so you can OOM with plenty of heap headroom.
CrashLoopBackOff masks the root cause — it looks like a crashy app, but the real signal is Reason: OOMKilled in the previous termination, not the restart loop itself.
Bumping the limit only buys time against a true leak. The line moves up; the curve still climbs into it, just later.
An unbounded cache looks exactly like a leak on the graph. Tell them apart: a leak grows forever, a hot cache should plateau once warm.

Worked example

A service is limited to 512 MB. A request batch loads 700 MB into memory at once. RSS climbs steadily and crosses 512 MB at roughly t = 8s. The kernel sends SIGKILL; the process exits 137 with Reason: OOMKilled.

The pod restarts, takes the same batch, and climbs into the same wall about every 30s — that repeating death is the CrashLoopBackOff you see in kubectl get pods. Three fixes, in order of durability: stream the batch (process it in chunks so peak RSS stays low), add a bounded buffer so input size can't dictate memory, or — as a stopgap — raise the limit above the 700 MB working set. Streaming is the real fix; raising the limit just moves the line.

Check yourself

A pod exits with code 137 and its previous state shows Reason: OOMKilled. What does that tell you?

You raise the pod's limit from 512 MB to 1 GB. It runs longer, then OOM-kills again. What's the likely cause?