Question
A new recommendations algorithm is behind a flag. At 15:00 latency starts climbing badly on the homepage; you decide to kill it. You flip the flag OFF in the flag dashboard at 15:04. Latency does NOT recover. You re-check the dashboard — flag is definitely OFF, 100% targeting. Dashboards: the recs-service latency stays high; flag-eval metric shows ~40% of pods STILL evaluating the flag as ON at 15:10, six minutes after the kill. The flag SDK is configured in polling mode with a 5-minute poll interval, and some pods cache flag state and only refresh on poll. There's no streaming/SSE flag delivery. How do you triage and mitigate, and what's the durable fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.