On-callMediumoc-g556

Subject On callLevel Mid–Senior~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Every few days, one or two pods of your notification service get OOM-killed and restart; lately it's getting more frequent. The memory dashboard shows a slow, steady sawtooth: each pod's heap climbs roughly linearly over ~3 days from 400MB to its 2GB limit, then OOMs and resets. Traffic is flat and there's no daily spike — the climb is independent of load. GC is running but reclaiming less and less over time (the post-GC baseline keeps rising). The last meaningful deploy was 10 days ago, which added an in-process LRU-ish cache for rendered email templates and a per-request metrics tagger. No single request is slow. How do you diagnose and fix a leak like this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.