On-callHardoc-g464

Subject Feature flagLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A deploy bumps the feature-flag SDK from v6 to v7 (a 'modernization' chore, no flag changes intended). For 50 minutes everything is normal. Then a slow drift: a paid 'priority_delivery' experiment that was supposed to be 8% on starts behaving as if ~50% of users are in the treatment, driving up courier-assignment cost. Dashboards: `flag_eval{flag="priority_delivery", variation="on"}` rose from 8% toward 50% over the first hour after deploy, tracking the rolling-deploy progress (pod by pod). No error rate change, no latency change. The flag's percentage rollout is unchanged in the dashboard. v7's changelog notes 'improved hashing for more uniform bucketing.' Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.