A/B variant regression

Ship the new model to a slice of traffic, watch the live metric per variant, and pull it back the moment the candidate quietly drifts below the control.

The idea

You have two model variants serving live: control A (the model in production today) and candidate B (the one you hope is better). Each handles a slice of real traffic, split by a stable hash of the user id so a given user always sees the same variant.

B passed every offline check — higher offline accuracy, a better ranking score. But offline metrics are a proxy. What matters is the live business metric: click-through rate, conversion, or latency. So you measure that metric per variant, compare them with a confidence test, and if B is meaningfully worse you cut its traffic to zero. That last move is the rollback.

Watch a regression unfold

4.4% 3.8% 3.2% rollback line A · control B · candidate hours since rollout →
Both variants take 50% of traffic. Press play to watch the live click-through rate roll in, hour by hour.
hour 0 A — B — Δ — A 50% · B 50%

How it works

Routing is deterministic: hash the user id into a bucket so the same user keeps the same variant across requests. Each served request logs the variant and whether the user clicked. On a schedule, you pool the per-variant samples, run a two-sample comparison, and act on the result.

The decision is one-sided and guarded: roll back only when the candidate is worse with confidence, not on the first noisy dip. A guardrail check (latency, error rate) can also force a rollback on its own.

WEIGHTS = {"A": 0.5, "B": 0.5}      # traffic split
samples = {"A": [], "B": []}        # 1 if clicked, else 0

def route(user_id):
    # stable bucketing: same user -> same variant
    b = hash_to_unit(user_id)        # in [0, 1)
    return "A" if b < WEIGHTS["A"] else "B"

def on_impression(variant, clicked):
    samples[variant].append(1 if clicked else 0)

def evaluate():                      # run periodically
    a, b = samples["A"], samples["B"]
    if len(a) < MIN_N or len(b) < MIN_N:
        return                       # not enough data yet
    delta = mean(b) - mean(a)        # candidate minus control
    p = two_sample_test(b, a)        # is B worse, with confidence?
    if delta < 0 and p < 0.05:        # significantly worse
        WEIGHTS["A"], WEIGHTS["B"] = 1.0, 0.0   # roll back to A
        alert("variant B regressed live CTR — rolled back")

The key line is delta < 0 and p < 0.05: a drop large enough that random noise is an unlikely explanation. Until both hold, you keep collecting.

Signals

SignalWhat it meansAction
Per-variant metric deltaB’s live metric minus A’s; negative means the candidate is below controlWatch the sign and size, not single-hour wiggles
Statistical significancep-value or a confidence interval that excludes zero differenceRoll back only when the drop is unlikely to be noise
Guardrail metricsLatency p95, error rate, timeouts on the candidate pathAny breach forces rollback regardless of the headline metric
Sample size per variantHow many sessions each variant has accruedHold the decision until both arms clear a minimum n
Segment breakdownMetric split by device, region, or new vs returning userCheck that an aggregate win isn’t hiding a segment loss

Watch out for

Worked example

A recommender team ships candidate B at a 50/50 split. Offline, B raised NDCG by 3 points, so it looked like a clear win. Live, the picture diverged: control A held click-through rate near 4.0%, while B slid hour by hour toward 3.4%.

After roughly 80k sessions per arm, the delta of -0.6 points cleared the 95% confidence bar — the interval no longer included zero. The job set B’s weight to 0, returning all traffic to A, and paged the owners. Post-mortem: B over-indexed on long-tail items that scored well offline but users rarely clicked. The offline metric had rewarded exactly the behaviour the live metric punished.

Check yourself

B’s live CTR dips below A’s for a single hour, then they cross again. Should you roll back?

B’s offline ranking score beats A’s, but its live click-through rate is significantly lower. Which metric decides the rollback?