On-callHardoc-g476

Subject Canary failureLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A release of a long-running stream-processing service passes a 30-minute canary cleanly and auto-promotes at 12:00. The canary's heap, latency, and error rate were all flat. But starting ~3 hours after full promotion, pods across the fleet begin OOMKilling in a slow rolling wave. Dashboards: per-pod heap climbs slowly and linearly from each pod's start time, with a slope of ~50MB/hour — under a 30-minute canary the rise was a barely-visible ~25MB and well within headroom. The new release added an unbounded in-memory map keyed by stream-session that's never evicted. Triage, explain why the canary passed, then mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.