On-callMediumoc-g278

Subject Feature flagLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A new recommendations algorithm is behind a flag. At 15:00 latency starts climbing badly on the homepage; you decide to kill it. You flip the flag OFF in the flag dashboard at 15:04. Latency does NOT recover. You re-check the dashboard — flag is definitely OFF, 100% targeting. Dashboards: the recs-service latency stays high; flag-eval metric shows ~40% of pods STILL evaluating the flag as ON at 15:10, six minutes after the kill. The flag SDK is configured in polling mode with a 5-minute poll interval, and some pods cache flag state and only refresh on poll. There's no streaming/SSE flag delivery. How do you triage and mitigate, and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.