Traffic spike

When the flood arrives faster than you can add boats, you bail water at the door so the boat you have stays afloat.

The idea

A traffic spike is a sudden surge of incoming load — a flash sale, a viral link, a retry storm — that arrives far quicker than your service can grow to meet it. Autoscaling helps, but new instances take time to boot, warm their caches, and join the pool, so for a window the spike outruns your capacity.

The on-call move is two-handed: scale up to add capacity for the long run, and shed or rate-limit the excess right now so the work you do accept stays fast and correct. A spike you absorb degrades gracefully; a spike you let in unbounded cascades — queues back up, latency explodes, and the whole service collapses.

See it work

Baseline: load is steady and capacity comfortably covers it.

How it works

on_tick(metrics):                      # the triage → contain loop
    # 1. DETECT — is this a spike?
    if metrics.p99 > slo or metrics.queue_depth rising
       or metrics.inbound_rps >> baseline:
        declare_spike()

    # 2. SCALE — ask for capacity (it arrives LATE)
    target = headroom_factor * metrics.inbound_rps
    autoscaler.desire(target)          # boot, warm, join pool: slow

    # 3. CONTAIN — the spike outruns scale-up, so cap intake NOW
    safe_rps = current_capacity * 0.85 # leave headroom; never run at 100%
    if metrics.inbound_rps > safe_rps:
        # admit what we can serve, shed the rest deterministically
        rate_limit(to = safe_rps)
        shed(priority = LOW_FIRST)     # 429 / queue overflow / fast-fail

    # 4. PROTECT THE CORE — keep critical paths fast
    isolate(critical_pool)             # bulkhead checkout from search
    if downstream_slow: open_circuit() # stop hammering, fail fast

    # capacity catches up → relax limits → scale back down when calm

The key insight: shedding is not giving up. By refusing excess at the edge, you keep the requests you do accept inside the latency budget, so they succeed instead of all timing out together.

Signals

Symptom	What it tells you
p99 latency rising	Work is queuing — requests wait behind a backlog before they run. The earliest honest sign of saturation.
Queue depth growing	Arrival rate exceeds service rate. If it grows unbounded, latency heads to infinity and you are already overloaded.
Error rate climbing	Timeouts, dropped connections, or shed responses (`429`). Distinguish deliberate shedding from uncontrolled collapse.
CPU / connections saturated	You have hit a hard resource ceiling. More requests now only steal time from in-flight ones — the signal to shed, not push harder.
Retry rate spiking	Clients are amplifying the load. Each failure becomes two or three more requests — a self-reinforcing storm you must dampen.

Watch out for

Scale-up is too slow. Booting and warming an instance takes minutes; the spike lands in seconds. Autoscaling alone never wins the race — you need shedding to bridge the gap.
Retry storms amplify the spike. Aggressive client retries on failure turn one wave into three. Use exponential backoff with jitter and a retry budget, or you fuel the fire you are fighting.
No shedding, so everything fails together. Without an intake cap, every request slows past the timeout at once — you get 100% failure instead of serving 85% well.
Cold caches after scale-up. Fresh instances start with empty caches and miss to the database, so they are briefly slower than warm ones and can hammer downstreams. Pre-warm or ramp traffic in gradually.
Shedding the wrong traffic. Dropping checkouts to protect search is backwards. Shed by priority — low-value, retryable, or anonymous traffic first — and keep critical paths flowing.

Worked example

A flash sale opens at noon. At 11:59 the service hums at 40 rps on 2 instances rated for ~100 rps. At 12:00:00 a push notification lands and inbound jumps to 320 rps in under ten seconds.

By 12:00:05 p99 latency has tripled and the request queue is climbing — the detect alarms fire. The autoscaler requests 6 more instances, but they will not be healthy and warm for ~90 seconds. That window is the danger: incoming is 320 rps, capacity is still ~100, and the gap of 220 rps is piling into queues.

So on-call contains: a rate limiter caps intake at ~85 rps (85% of current capacity) and sheds the rest with a fast 429 plus a "high demand, retrying" page — anonymous browse traffic first, logged-in checkouts last. The traffic it accepts now stays under 200 ms instead of all timing out together. Around 12:01:30 the new instances come online, capacity steps to ~400 rps, the limiter relaxes, and the queue drains. The sale was absorbed: most customers waited a few seconds, none saw a crashed site. Once the surge settles, the extra instances scale back down.

Check yourself

During the danger window, the autoscaler has already requested more instances but they are not healthy yet. What should on-call do right now?

Clients are retrying failed requests aggressively, and inbound rps keeps climbing even as you add capacity. What is the most likely dynamic?