Canary deploy & the bad version

Send the new version a thin slice of traffic first, watch it like a hawk, and let the watchdog pull it back the moment it misbehaves.

The idea

When you ship a new model version, you don't flip everyone over at once. You stand v2 up beside the stable v1 and route just a small slice of requests to it — a canary. Most traffic keeps flowing to the version you already trust.

A watchdog measures v2's error rate on its slice. If v2 has a defect, that rate climbs. The moment it crosses your SLO threshold, the system drains v2 to zero and restores 100% of traffic to v1 — automatically, before the regression reaches everyone.

The whole point: keep the blast radius tiny while you find out whether the new version is actually good.

Watch the canary breach & roll back

load balancer v1 (stable) 95% v2 (canary) 5% v2 error rate SLO

v2 is taking 5% of traffic. The watchdog is sampling its error rate.

step 0 / 8 v1 95% / v2 5% v2 err 0.3% SLO 1.5% healthy

How it works

The router rolls a random number per request and sends it to v2 only when the roll lands inside the canary weight. The watchdog counts v2's errors over a sliding window. Once there are enough samples to trust the number, it compares v2's error rate to the SLO. A breach sets the canary weight to 0 — that single change drains v2 and promotes the rollback.

canary_weight = 0.05        # 5% to v2, rest to v1
slo            = 0.015       # 1.5% error-rate ceiling for v2
min_samples    = 200        # don't judge on too few requests

def route(req):
    if random() < canary_weight:
        version = "v2"
    else:
        version = "v1"
    resp = serve(version, req)
    if version == "v2":
        window.record(ok = resp.ok)
    return resp

def watchdog_tick():
    global canary_weight
    if window.count < min_samples:
        return                       # not enough signal yet
    v2_err = window.error_rate()
    if v2_err > slo:
        canary_weight = 0            # drain v2 -> 100% back on v1
        promote_rollback("v2 breached SLO")

Signals to watch

SignalMeaningAction
v2 error rate vs v1Is the canary worse than the baseline it replaces?Compare, don't judge v2 in isolation
Latency p95 deltav2 may answer but answer slowlyRoll back on latency regression too
Sample countHow much traffic v2 has actually seenWait for min_samples before deciding
SLO thresholdThe error-rate ceiling that defines "breached"Set it from the v1 baseline, not a guess
Rollback modeDid a human or the watchdog act?Prefer automatic; humans sleep

Watch out for

Worked example

A team shipped v2 with a tokenizer change. On its 5% canary slice, the bad-prediction rate climbed from 0.3% to 2.1% within four minutes. The watchdog tripped at the 1.5% SLO, drained v2 to zero, and restored 100% of traffic to v1.

Because only the canary slice ever touched v2, the blast radius stayed at 5% of traffic for four minutes — not a full-fleet incident. The fix shipped the next day behind a fresh canary.

Check yourself

v2's error rate just crossed the SLO threshold on its canary slice. What should the system do?

Why compare v2's error rate to v1's live baseline rather than to a fixed number?