Question
A release of a long-running stream-processing service passes a 30-minute canary cleanly and auto-promotes at 12:00. The canary's heap, latency, and error rate were all flat. But starting ~3 hours after full promotion, pods across the fleet begin OOMKilling in a slow rolling wave. Dashboards: per-pod heap climbs slowly and linearly from each pod's start time, with a slope of ~50MB/hour — under a 30-minute canary the rise was a barely-visible ~25MB and well within headroom. The new release added an unbounded in-memory map keyed by stream-session that's never evicted. Triage, explain why the canary passed, then mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.