Shadow traffic overload

Mirroring live requests to a new model is free safety — until the mirror eats the capacity the real traffic needs.

The idea

Shadow traffic (or "dark launch") sends a copy of real production requests to a candidate model so you can compare its predictions without exposing users to them. The user only ever sees the primary model's answer; the shadow's output is logged and discarded.

The trap is capacity. A shadowed request now does two inferences instead of one. If the shadow runs on the same fleet, or you mirror 100% of traffic, you can quietly double the load and push the primary path into timeouts. Shadowing is meant to be invisible — overload is how it stops being invisible.

Press play to send a burst of requests through the primary, mirroring some to the shadow pool.

How it works

The router (a proxy or sidecar) forwards each request to the primary and, for a sampled fraction, fires a fire-and-forget copy at the shadow. Critically, the response to the user must never wait on the shadow: the mirror is asynchronous and best-effort.

def handle(request):
    # Primary path — this is what the user sees. Always runs.
    response = primary.infer(request)

    # Shadow path — sampled, async, best-effort. Failures are swallowed.
    if random() < mirror_fraction and shadow_pool.has_headroom():
        spawn(lambda: log_compare(shadow.infer(copy(request))))

    return response          # returned immediately, never blocks on shadow

Two guards keep the mirror from harming the primary: mirror_fraction caps how much traffic is duplicated, and has_headroom() sheds shadow load the moment the shadow pool is saturated. The shadow should also run on its own replicas (or its own capacity budget) so a slow candidate model can't steal threads from the live path.

Signals & trade-offs

LeverEffectWatch
Mirror % upMore comparison data, faster signalExtra load scales linearly
Shared fleetCheap, no extra hardwareShadow contends with primary
Isolated fleetPrimary protectedCosts real capacity to run
Sync mirrorSimpler codeCouples user latency to candidate

Watch out for

Worked example

A ranking service runs 10 replicas at 65% CPU during peak. A team dark-launches a new model at 100% mirror on the same replicas. Each request now does two inferences, so effective load jumps to ~130% — past saturation. Tail latency spikes, the load balancer marks replicas unhealthy, and on-call sees a primary outage with "no code change." The fix: drop the mirror to 10%, move the shadow to its own replica set, and add a has_headroom() gate so the shadow sheds first under pressure.

Check yourself

Your primary fleet is at 70% utilisation. You want to shadow a new model. Which mirror setting is safest on the same fleet?