Reverse-proxy rate limiting

Stop the flood at the front door so your origin only ever sees a steady trickle.

The idea

A reverse proxy (or API gateway) sits in front of your real servers and receives every request first. Before it forwards anything to the origin, it can decide: is this caller asking too fast? If so, it answers right there with 429 Too Many Requests and the origin never even hears about it.

The classic way to make that decision is a token bucket. Each client (per IP or per API key) gets a bucket that holds up to CAP tokens and refills at a steady RATE. Every allowed request spends one token. A short burst can drain the bucket fast — but once it's empty, further requests get rejected until the bucket trickles back up.

See it work

INCOMING REVERSE PROXY ORIGIN Gateway tokens 5 / 5 + refill Origin load 0
Bucket starts full: 5 tokens. Press Step forward to send the first request.

How it works

The gateway keeps one bucket per key — usually per API key, or per client IP. On each request it first lazily refills the bucket based on how much time has passed (no background timer needed): tokens go up by elapsed × RATE, capped at CAP. Then if at least one token is available, it spends one and forwards the request to the origin; otherwise it short-circuits with 429.

CAP  = 5      # bucket capacity (max burst)
RATE = 1      # tokens added per second

def allow(key, now):
    b = bucket[key]
    # lazy refill: credit tokens for elapsed time, capped at CAP
    b.tokens = min(CAP, b.tokens + (now - b.ts) * RATE)
    b.ts = now
    if b.tokens >= 1:
        b.tokens -= 1
        return True          # forward to origin
    return False             # 429 Too Many Requests

# at the edge, before proxying:
if not allow(client_key, time.monotonic()):
    return Response(status=429, headers={"Retry-After": "1"})
forward_to_origin(request)

Enforcing this at the edge matters: a rejected request costs the gateway a tiny token check, while the origin — your database, your business logic — does zero work. The Retry-After header tells well-behaved clients exactly how long to back off.

Trade-offs

ApproachBurst behaviorNote
Token bucket Allows bursts up to CAP, then steady RATE Smooth and forgiving; one counter + timestamp per key
Fixed window Up to limit across a window boundary Cheapest (one counter), but boundary bursts slip through
Sliding log Exact — no boundary spike Most accurate, but stores a timestamp per request (memory heavy)
Per-edge local count Fast, zero network hop With N edges the real limit is the intended one
Shared (Redis) count Accurate across the whole cluster Adds a round-trip of latency to every request

Watch out for

Worked example

Capacity CAP = 5, refill RATE = 1 token/second. The bucket starts full. A client fires a burst of 8 requests within one second — too quick for any meaningful refill:

start:  tokens = 5
r1  spend -> tokens 4   allowed -> origin
r2  spend -> tokens 3   allowed -> origin
r3  spend -> tokens 2   allowed -> origin
r4  spend -> tokens 1   allowed -> origin
r5  spend -> tokens 0   allowed -> origin
r6  empty -> 429 Too Many Requests (Retry-After: 1)
r7  empty -> 429
r8  empty -> 429
--- 1 second later, lazy refill adds 1 token ---
        tokens = 1
r9  spend -> tokens 0   allowed -> origin

So of the 8-request burst, 5 are allowed and 3 get 429. The origin only ever saw 5 requests. One second later the bucket has refilled by exactly one token, so the next request is allowed again — a steady trickle, as promised.

Check yourself

1. You run the same 100-requests-per-minute token-bucket limit independently on each of 4 gateway edges, with no shared store. What's the real limit a single client can hit?

2. With CAP = 5 and RATE = 1/sec, a client sends 8 requests in a single second. How many reach the origin?