Canary deploy & the bad version

Send the new version a thin slice of traffic first, watch it like a hawk, and let the watchdog pull it back the moment it misbehaves.

The idea

When you ship a new model version, you don't flip everyone over at once. You stand v2 up beside the stable v1 and route just a small slice of requests to it — a canary. Most traffic keeps flowing to the version you already trust.

A watchdog measures v2's error rate on its slice. If v2 has a defect, that rate climbs. The moment it crosses your SLO threshold, the system drains v2 to zero and restores 100% of traffic to v1 — automatically, before the regression reaches everyone.

The whole point: keep the blast radius tiny while you find out whether the new version is actually good.

Watch the canary breach & roll back

v2 is taking 5% of traffic. The watchdog is sampling its error rate.

step 0 / 8 v1 95% / v2 5% v2 err 0.3% SLO 1.5% healthy

How it works

The router rolls a random number per request and sends it to v2 only when the roll lands inside the canary weight. The watchdog counts v2's errors over a sliding window. Once there are enough samples to trust the number, it compares v2's error rate to the SLO. A breach sets the canary weight to 0 — that single change drains v2 and promotes the rollback.

canary_weight = 0.05        # 5% to v2, rest to v1
slo            = 0.015       # 1.5% error-rate ceiling for v2
min_samples    = 200        # don't judge on too few requests

def route(req):
    if random() < canary_weight:
        version = "v2"
    else:
        version = "v1"
    resp = serve(version, req)
    if version == "v2":
        window.record(ok = resp.ok)
    return resp

def watchdog_tick():
    global canary_weight
    if window.count < min_samples:
        return                       # not enough signal yet
    v2_err = window.error_rate()
    if v2_err > slo:
        canary_weight = 0            # drain v2 -> 100% back on v1
        promote_rollback("v2 breached SLO")

Signals to watch

Signal	Meaning	Action
v2 error rate vs v1	Is the canary worse than the baseline it replaces?	Compare, don't judge v2 in isolation
Latency p95 delta	v2 may answer but answer slowly	Roll back on latency regression too
Sample count	How much traffic v2 has actually seen	Wait for `min_samples` before deciding
SLO threshold	The error-rate ceiling that defines "breached"	Set it from the v1 baseline, not a guess
Rollback mode	Did a human or the watchdog act?	Prefer automatic; humans sleep

Watch out for

Canary too small to detect anything — 0.5% of traffic may never gather enough samples to trip the watchdog in time.
No automatic rollback, so the safety net is a human who happens to be awake and looking at the dashboard.
Comparing v2's errors to an absolute number instead of v1's live baseline — a noisy day looks like a regression.
The bad version is cached downstream, so draining traffic doesn't fully clear it — clear caches as part of rollback.
Rolling forward (a quick patch) under pressure instead of rolling back to the known-good version first.
Not baking or warming v2, so cold-start latency and empty caches confound the signal and look like a defect.

Worked example

A team shipped v2 with a tokenizer change. On its 5% canary slice, the bad-prediction rate climbed from 0.3% to 2.1% within four minutes. The watchdog tripped at the 1.5% SLO, drained v2 to zero, and restored 100% of traffic to v1.

Because only the canary slice ever touched v2, the blast radius stayed at 5% of traffic for four minutes — not a full-fleet incident. The fix shipped the next day behind a fresh canary.

Check yourself

v2's error rate just crossed the SLO threshold on its canary slice. What should the system do?

Why compare v2's error rate to v1's live baseline rather than to a fixed number?