Ship the new model to a slice of traffic, watch the live metric per variant, and pull it back the moment the candidate quietly drifts below the control.
You have two model variants serving live: control A (the model in production today) and candidate B (the one you hope is better). Each handles a slice of real traffic, split by a stable hash of the user id so a given user always sees the same variant.
B passed every offline check — higher offline accuracy, a better ranking score. But offline metrics are a proxy. What matters is the live business metric: click-through rate, conversion, or latency. So you measure that metric per variant, compare them with a confidence test, and if B is meaningfully worse you cut its traffic to zero. That last move is the rollback.
Routing is deterministic: hash the user id into a bucket so the same user keeps the same variant across requests. Each served request logs the variant and whether the user clicked. On a schedule, you pool the per-variant samples, run a two-sample comparison, and act on the result.
The decision is one-sided and guarded: roll back only when the candidate is worse with confidence, not on the first noisy dip. A guardrail check (latency, error rate) can also force a rollback on its own.
WEIGHTS = {"A": 0.5, "B": 0.5} # traffic split
samples = {"A": [], "B": []} # 1 if clicked, else 0
def route(user_id):
# stable bucketing: same user -> same variant
b = hash_to_unit(user_id) # in [0, 1)
return "A" if b < WEIGHTS["A"] else "B"
def on_impression(variant, clicked):
samples[variant].append(1 if clicked else 0)
def evaluate(): # run periodically
a, b = samples["A"], samples["B"]
if len(a) < MIN_N or len(b) < MIN_N:
return # not enough data yet
delta = mean(b) - mean(a) # candidate minus control
p = two_sample_test(b, a) # is B worse, with confidence?
if delta < 0 and p < 0.05: # significantly worse
WEIGHTS["A"], WEIGHTS["B"] = 1.0, 0.0 # roll back to A
alert("variant B regressed live CTR — rolled back")
The key line is delta < 0 and p < 0.05: a drop large enough that random noise is an unlikely explanation. Until both hold, you keep collecting.
| Signal | What it means | Action |
|---|---|---|
| Per-variant metric delta | B’s live metric minus A’s; negative means the candidate is below control | Watch the sign and size, not single-hour wiggles |
| Statistical significance | p-value or a confidence interval that excludes zero difference | Roll back only when the drop is unlikely to be noise |
| Guardrail metrics | Latency p95, error rate, timeouts on the candidate path | Any breach forces rollback regardless of the headline metric |
| Sample size per variant | How many sessions each variant has accrued | Hold the decision until both arms clear a minimum n |
| Segment breakdown | Metric split by device, region, or new vs returning user | Check that an aggregate win isn’t hiding a segment loss |
A recommender team ships candidate B at a 50/50 split. Offline, B raised NDCG by 3 points, so it looked like a clear win. Live, the picture diverged: control A held click-through rate near 4.0%, while B slid hour by hour toward 3.4%.
After roughly 80k sessions per arm, the delta of -0.6 points cleared the 95% confidence bar — the interval no longer included zero. The job set B’s weight to 0, returning all traffic to A, and paged the owners. Post-mortem: B over-indexed on long-tail items that scored well offline but users rarely clicked. The offline metric had rewarded exactly the behaviour the live metric punished.
B’s live CTR dips below A’s for a single hour, then they cross again. Should you roll back?
B’s offline ranking score beats A’s, but its live click-through rate is significantly lower. Which metric decides the rollback?