Code Room
On-callHardoc-g404
Subject Memory leakLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A Go request-fan-out service's RSS climbs ~120MB/hour and it gets OOMKilled roughly every 20 hours; restarting resets it and the climb begins again. Heap dashboards show in-use heap rising slowly, but the standout signal is that the `go_goroutines` metric rises monotonically — from ~300 at boot to tens of thousands by hour 18 — tracking RSS almost exactly. A pprof goroutine dump shows huge counts blocked on channel receive in one helper that fans out to several backends and waits for the first result. A change last week added a slow optional backend to that fan-out. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.