On-callMediumoc-g274

Subject Bad rolloutLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A new version of the `feed` service rolls out fully at 10:00. No immediate problem. Starting ~10:40 and worsening over the next hour, pods begin getting OOMKilled and entering CrashLoopBackOff in a slow rolling wave; capacity drops, latency climbs, and the autoscaler thrashes. Dashboards: per-pod RSS climbs roughly linearly from pod start until it hits the memory limit, then OOMKill, then the cycle repeats on the fresh pod; GC pause time is flat; request rate is normal. The new version added a per-request in-memory dedup set keyed by item ID, intended to be request-scoped, but it was attached to a module-level singleton. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.