Question
A routine deploy of the web tier goes out at 09:30. No incident for 40 minutes. At 10:11, conversion on checkout quietly drops ~30% — no error spike, no latency change, 200s across the board. The only signal: a custom business metric `checkout_completed` falling, and a rise in `flag_eval{flag="new_address_validator", variation="on"}`. The flag was created two sprints ago, evaluated to OFF for everyone, and was set as code-default OFF. The 09:30 deploy bumped the flag SDK major version; in the new SDK, an unrecognized/unset flag now returns the *code default*, and someone had since changed the code default to ON in a refactor — but the dashboard in the flag service still shows the targeting rule as OFF. How do you triage and what's the fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.