Mirroring live requests to a new model is free safety — until the mirror eats the capacity the real traffic needs.
Shadow traffic (or "dark launch") sends a copy of real production requests to a candidate model so you can compare its predictions without exposing users to them. The user only ever sees the primary model's answer; the shadow's output is logged and discarded.
The trap is capacity. A shadowed request now does two inferences instead of one. If the shadow runs on the same fleet, or you mirror 100% of traffic, you can quietly double the load and push the primary path into timeouts. Shadowing is meant to be invisible — overload is how it stops being invisible.
The router (a proxy or sidecar) forwards each request to the primary and, for a sampled fraction, fires a fire-and-forget copy at the shadow. Critically, the response to the user must never wait on the shadow: the mirror is asynchronous and best-effort.
def handle(request):
# Primary path — this is what the user sees. Always runs.
response = primary.infer(request)
# Shadow path — sampled, async, best-effort. Failures are swallowed.
if random() < mirror_fraction and shadow_pool.has_headroom():
spawn(lambda: log_compare(shadow.infer(copy(request))))
return response # returned immediately, never blocks on shadow
Two guards keep the mirror from harming the primary: mirror_fraction caps how much traffic is duplicated, and has_headroom() sheds shadow load the moment the shadow pool is saturated. The shadow should also run on its own replicas (or its own capacity budget) so a slow candidate model can't steal threads from the live path.
| Lever | Effect | Watch |
|---|---|---|
| Mirror % up | More comparison data, faster signal | Extra load scales linearly |
| Shared fleet | Cheap, no extra hardware | Shadow contends with primary |
| Isolated fleet | Primary protected | Costs real capacity to run |
| Sync mirror | Simpler code | Couples user latency to candidate |
A ranking service runs 10 replicas at 65% CPU during peak. A team dark-launches a new model at 100% mirror on the same replicas. Each request now does two inferences, so effective load jumps to ~130% — past saturation. Tail latency spikes, the load balancer marks replicas unhealthy, and on-call sees a primary outage with "no code change." The fix: drop the mirror to 10%, move the shadow to its own replica set, and add a has_headroom() gate so the shadow sheds first under pressure.
Your primary fleet is at 70% utilisation. You want to shadow a new model. Which mirror setting is safest on the same fleet?