Question
A new version of the `feed` service rolls out fully at 10:00. No immediate problem. Starting ~10:40 and worsening over the next hour, pods begin getting OOMKilled and entering CrashLoopBackOff in a slow rolling wave; capacity drops, latency climbs, and the autoscaler thrashes. Dashboards: per-pod RSS climbs roughly linearly from pod start until it hits the memory limit, then OOMKill, then the cycle repeats on the fresh pod; GC pause time is flat; request rate is normal. The new version added a per-request in-memory dedup set keyed by item ID, intended to be request-scoped, but it was attached to a module-level singleton. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.