Question
No code deploy in 3 days. At 09:15 a PM flips a feature flag 'new_recommendations' from 5% to 100% via the flag dashboard. Within 90 seconds the recommendations service (Go) sees CPU pegged at 100% across all pods, p99 jumps to 7s, and the upstream homepage starts returning partial responses. The flag-eval SDK metrics show evaluation latency normal. APM shows the new code path makes 1 synchronous call per render to a personalization service that was previously only hit for 5% of traffic. That service's error rate is now 30% and its own pods are OOMKilling. Triage and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.