On-callHardoc-g266

Subject Feature flagLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A routine deploy of the web tier goes out at 09:30. No incident for 40 minutes. At 10:11, conversion on checkout quietly drops ~30% — no error spike, no latency change, 200s across the board. The only signal: a custom business metric `checkout_completed` falling, and a rise in `flag_eval{flag="new_address_validator", variation="on"}`. The flag was created two sprints ago, evaluated to OFF for everyone, and was set as code-default OFF. The 09:30 deploy bumped the flag SDK major version; in the new SDK, an unrecognized/unset flag now returns the *code default*, and someone had since changed the code default to ON in a refactor — but the dashboard in the flag service still shows the targeting rule as OFF. How do you triage and what's the fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.