On-callHardoc-g046

Subject Feature flagLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

No code deploy in 3 days. At 09:15 a PM flips a feature flag 'new_recommendations' from 5% to 100% via the flag dashboard. Within 90 seconds the recommendations service (Go) sees CPU pegged at 100% across all pods, p99 jumps to 7s, and the upstream homepage starts returning partial responses. The flag-eval SDK metrics show evaluation latency normal. APM shows the new code path makes 1 synchronous call per render to a personalization service that was previously only hit for 5% of traffic. That service's error rate is now 30% and its own pods are OOMKilling. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.