Shadow traffic overload

Mirroring live requests to a new model is free safety — until the mirror eats the capacity the real traffic needs.

The idea

Shadow traffic (or "dark launch") sends a copy of real production requests to a candidate model so you can compare its predictions without exposing users to them. The user only ever sees the primary model's answer; the shadow's output is logged and discarded.

The trap is capacity. A shadowed request now does two inferences instead of one. If the shadow runs on the same fleet, or you mirror 100% of traffic, you can quietly double the load and push the primary path into timeouts. Shadowing is meant to be invisible — overload is how it stops being invisible.

Mirror % 100%

Press play to send a burst of requests through the primary, mirroring some to the shadow pool.

How it works

The router (a proxy or sidecar) forwards each request to the primary and, for a sampled fraction, fires a fire-and-forget copy at the shadow. Critically, the response to the user must never wait on the shadow: the mirror is asynchronous and best-effort.

def handle(request):
    # Primary path — this is what the user sees. Always runs.
    response = primary.infer(request)

    # Shadow path — sampled, async, best-effort. Failures are swallowed.
    if random() < mirror_fraction and shadow_pool.has_headroom():
        spawn(lambda: log_compare(shadow.infer(copy(request))))

    return response          # returned immediately, never blocks on shadow

Two guards keep the mirror from harming the primary: mirror_fraction caps how much traffic is duplicated, and has_headroom() sheds shadow load the moment the shadow pool is saturated. The shadow should also run on its own replicas (or its own capacity budget) so a slow candidate model can't steal threads from the live path.

Signals & trade-offs

Lever	Effect	Watch
Mirror % up	More comparison data, faster signal	Extra load scales linearly
Shared fleet	Cheap, no extra hardware	Shadow contends with primary
Isolated fleet	Primary protected	Costs real capacity to run
Sync mirror	Simpler code	Couples user latency to candidate

Watch out for

Mirroring 100% onto the same fleet. You just doubled inference load. If you were at 60% utilisation, you're now at 120% — the primary path starts timing out.
Blocking on the shadow. If the response waits for both, a slow candidate model adds its latency to every user request. The mirror must be fire-and-forget.
Side effects in the shadow path. A mirrored request that writes to a database, charges a card, or sends a notification causes real, duplicated actions. Shadow reads only.
No headroom check. Without backpressure, a shadow GPU OOM or queue buildup feeds back into shared resources and degrades the primary.
Forgetting to turn it off. A dark launch left running for weeks is a permanent capacity tax that nobody is reading the logs from.

Worked example

A ranking service runs 10 replicas at 65% CPU during peak. A team dark-launches a new model at 100% mirror on the same replicas. Each request now does two inferences, so effective load jumps to ~130% — past saturation. Tail latency spikes, the load balancer marks replicas unhealthy, and on-call sees a primary outage with "no code change." The fix: drop the mirror to 10%, move the shadow to its own replica set, and add a has_headroom() gate so the shadow sheds first under pressure.

Check yourself

Your primary fleet is at 70% utilisation. You want to shadow a new model. Which mirror setting is safest on the same fleet?