A/B variant regression

Ship the new model to a slice of traffic, watch the live metric per variant, and pull it back the moment the candidate quietly drifts below the control.

The idea

You have two model variants serving live: control A (the model in production today) and candidate B (the one you hope is better). Each handles a slice of real traffic, split by a stable hash of the user id so a given user always sees the same variant.

B passed every offline check — higher offline accuracy, a better ranking score. But offline metrics are a proxy. What matters is the live business metric: click-through rate, conversion, or latency. So you measure that metric per variant, compare them with a confidence test, and if B is meaningfully worse you cut its traffic to zero. That last move is the rollback.

Watch a regression unfold

Both variants take 50% of traffic. Press play to watch the live click-through rate roll in, hour by hour.

hour 0 A — B — Δ — A 50% · B 50%

How it works

Routing is deterministic: hash the user id into a bucket so the same user keeps the same variant across requests. Each served request logs the variant and whether the user clicked. On a schedule, you pool the per-variant samples, run a two-sample comparison, and act on the result.

The decision is one-sided and guarded: roll back only when the candidate is worse with confidence, not on the first noisy dip. A guardrail check (latency, error rate) can also force a rollback on its own.

WEIGHTS = {"A": 0.5, "B": 0.5}      # traffic split
samples = {"A": [], "B": []}        # 1 if clicked, else 0

def route(user_id):
    # stable bucketing: same user -> same variant
    b = hash_to_unit(user_id)        # in [0, 1)
    return "A" if b < WEIGHTS["A"] else "B"

def on_impression(variant, clicked):
    samples[variant].append(1 if clicked else 0)

def evaluate():                      # run periodically
    a, b = samples["A"], samples["B"]
    if len(a) < MIN_N or len(b) < MIN_N:
        return                       # not enough data yet
    delta = mean(b) - mean(a)        # candidate minus control
    p = two_sample_test(b, a)        # is B worse, with confidence?
    if delta < 0 and p < 0.05:        # significantly worse
        WEIGHTS["A"], WEIGHTS["B"] = 1.0, 0.0   # roll back to A
        alert("variant B regressed live CTR — rolled back")

The key line is delta < 0 and p < 0.05: a drop large enough that random noise is an unlikely explanation. Until both hold, you keep collecting.

Signals

Signal	What it means	Action
Per-variant metric delta	B’s live metric minus A’s; negative means the candidate is below control	Watch the sign and size, not single-hour wiggles
Statistical significance	p-value or a confidence interval that excludes zero difference	Roll back only when the drop is unlikely to be noise
Guardrail metrics	Latency p95, error rate, timeouts on the candidate path	Any breach forces rollback regardless of the headline metric
Sample size per variant	How many sessions each variant has accrued	Hold the decision until both arms clear a minimum n
Segment breakdown	Metric split by device, region, or new vs returning user	Check that an aggregate win isn’t hiding a segment loss

Watch out for

Peeking — checking significance every hour and acting the first time it crosses inflates false positives. Use a fixed horizon or a sequential test built for repeated looks.
Too little data — a few hundred sessions per arm produces wild swings. Set a minimum sample size before any rollback fires.
Simpson’s paradox — B can win in every segment yet lose overall (or the reverse) when the traffic mix differs between arms. Read the segment breakdown.
Measuring the offline proxy — a higher offline ranking score is not the live business metric. Decide on CTR or conversion, not the proxy B was tuned for.
Novelty effect — a new model can spike or sag for the first hours simply because it’s new. Let the curve settle before you trust the delta.
Forgetting guardrails — a candidate that lifts CTR but doubles p95 latency or error rate is still a regression. Wire latency and errors as hard rollback triggers.

Worked example

A recommender team ships candidate B at a 50/50 split. Offline, B raised NDCG by 3 points, so it looked like a clear win. Live, the picture diverged: control A held click-through rate near 4.0%, while B slid hour by hour toward 3.4%.

After roughly 80k sessions per arm, the delta of -0.6 points cleared the 95% confidence bar — the interval no longer included zero. The job set B’s weight to 0, returning all traffic to A, and paged the owners. Post-mortem: B over-indexed on long-tail items that scored well offline but users rarely clicked. The offline metric had rewarded exactly the behaviour the live metric punished.

Check yourself

B’s live CTR dips below A’s for a single hour, then they cross again. Should you roll back?

B’s offline ranking score beats A’s, but its live click-through rate is significantly lower. Which metric decides the rollback?