Send the new version a thin slice of traffic first, watch it like a hawk, and let the watchdog pull it back the moment it misbehaves.
When you ship a new model version, you don't flip everyone over at once. You stand v2 up beside the stable v1 and route just a small slice of requests to it — a canary. Most traffic keeps flowing to the version you already trust.
A watchdog measures v2's error rate on its slice. If v2 has a defect, that rate climbs. The moment it crosses your SLO threshold, the system drains v2 to zero and restores 100% of traffic to v1 — automatically, before the regression reaches everyone.
The whole point: keep the blast radius tiny while you find out whether the new version is actually good.
v2 is taking 5% of traffic. The watchdog is sampling its error rate.
The router rolls a random number per request and sends it to v2 only when the roll lands inside the canary weight. The watchdog counts v2's errors over a sliding window. Once there are enough samples to trust the number, it compares v2's error rate to the SLO. A breach sets the canary weight to 0 — that single change drains v2 and promotes the rollback.
canary_weight = 0.05 # 5% to v2, rest to v1
slo = 0.015 # 1.5% error-rate ceiling for v2
min_samples = 200 # don't judge on too few requests
def route(req):
if random() < canary_weight:
version = "v2"
else:
version = "v1"
resp = serve(version, req)
if version == "v2":
window.record(ok = resp.ok)
return resp
def watchdog_tick():
global canary_weight
if window.count < min_samples:
return # not enough signal yet
v2_err = window.error_rate()
if v2_err > slo:
canary_weight = 0 # drain v2 -> 100% back on v1
promote_rollback("v2 breached SLO")
| Signal | Meaning | Action |
|---|---|---|
| v2 error rate vs v1 | Is the canary worse than the baseline it replaces? | Compare, don't judge v2 in isolation |
| Latency p95 delta | v2 may answer but answer slowly | Roll back on latency regression too |
| Sample count | How much traffic v2 has actually seen | Wait for min_samples before deciding |
| SLO threshold | The error-rate ceiling that defines "breached" | Set it from the v1 baseline, not a guess |
| Rollback mode | Did a human or the watchdog act? | Prefer automatic; humans sleep |
A team shipped v2 with a tokenizer change. On its 5% canary slice, the bad-prediction rate climbed from 0.3% to 2.1% within four minutes. The watchdog tripped at the 1.5% SLO, drained v2 to zero, and restored 100% of traffic to v1.
Because only the canary slice ever touched v2, the blast radius stayed at 5% of traffic for four minutes — not a full-fleet incident. The fix shipped the next day behind a fresh canary.
v2's error rate just crossed the SLO threshold on its canary slice. What should the system do?
Why compare v2's error rate to v1's live baseline rather than to a fixed number?