Reverse-proxy rate limiting

Stop the flood at the front door so your origin only ever sees a steady trickle.

The idea

A reverse proxy (or API gateway) sits in front of your real servers and receives every request first. Before it forwards anything to the origin, it can decide: is this caller asking too fast? If so, it answers right there with 429 Too Many Requests and the origin never even hears about it.

The classic way to make that decision is a token bucket. Each client (per IP or per API key) gets a bucket that holds up to CAP tokens and refills at a steady RATE. Every allowed request spends one token. A short burst can drain the bucket fast — but once it's empty, further requests get rejected until the bucket trickles back up.

See it work

Bucket starts full: 5 tokens. Press Step forward to send the first request.

How it works

The gateway keeps one bucket per key — usually per API key, or per client IP. On each request it first lazily refills the bucket based on how much time has passed (no background timer needed): tokens go up by elapsed × RATE, capped at CAP. Then if at least one token is available, it spends one and forwards the request to the origin; otherwise it short-circuits with 429.

CAP  = 5      # bucket capacity (max burst)
RATE = 1      # tokens added per second

def allow(key, now):
    b = bucket[key]
    # lazy refill: credit tokens for elapsed time, capped at CAP
    b.tokens = min(CAP, b.tokens + (now - b.ts) * RATE)
    b.ts = now
    if b.tokens >= 1:
        b.tokens -= 1
        return True          # forward to origin
    return False             # 429 Too Many Requests

# at the edge, before proxying:
if not allow(client_key, time.monotonic()):
    return Response(status=429, headers={"Retry-After": "1"})
forward_to_origin(request)

Enforcing this at the edge matters: a rejected request costs the gateway a tiny token check, while the origin — your database, your business logic — does zero work. The Retry-After header tells well-behaved clients exactly how long to back off.

Trade-offs

Approach	Burst behavior	Note
Token bucket	Allows bursts up to `CAP`, then steady `RATE`	Smooth and forgiving; one counter + timestamp per key
Fixed window	Up to `2×` limit across a window boundary	Cheapest (one counter), but boundary bursts slip through
Sliding log	Exact — no boundary spike	Most accurate, but stores a timestamp per request (memory heavy)
Per-edge local count	Fast, zero network hop	With `N` edges the real limit is `N×` the intended one
Shared (Redis) count	Accurate across the whole cluster	Adds a round-trip of latency to every request

Watch out for

Per-edge buckets undercount in a cluster. If 4 gateways each keep a local bucket of 100/min, a client can actually do 400/min. Use a shared store (Redis) or divide the limit by the edge count.
Fixed-window boundary bursts. A client can send a full window's worth at 11:59:59 and another full window at 12:00:00 — double the intended rate in two seconds. A token bucket or sliding window avoids this.
Limiting by spoofable client IP. Raw IPs can be shared (NAT, mobile carriers) or forged via X-Forwarded-For. Prefer authenticated API keys, and only trust forwarded IPs from your own proxies.
No Retry-After header. Without it, clients retry blindly and hammer you harder. Always tell them when to come back.
Forgetting rejected requests still cost something. TLS, parsing, and the limiter check aren't free — a flood of 429s can still strain the edge. Keep the reject path cheap and consider connection-level limits too.

Worked example

Capacity CAP = 5, refill RATE = 1 token/second. The bucket starts full. A client fires a burst of 8 requests within one second — too quick for any meaningful refill:

start:  tokens = 5
r1  spend -> tokens 4   allowed -> origin
r2  spend -> tokens 3   allowed -> origin
r3  spend -> tokens 2   allowed -> origin
r4  spend -> tokens 1   allowed -> origin
r5  spend -> tokens 0   allowed -> origin
r6  empty -> 429 Too Many Requests (Retry-After: 1)
r7  empty -> 429
r8  empty -> 429
--- 1 second later, lazy refill adds 1 token ---
        tokens = 1
r9  spend -> tokens 0   allowed -> origin

So of the 8-request burst, 5 are allowed and 3 get 429. The origin only ever saw 5 requests. One second later the bucket has refilled by exactly one token, so the next request is allowed again — a steady trickle, as promised.

Check yourself

1. You run the same 100-requests-per-minute token-bucket limit independently on each of 4 gateway edges, with no shared store. What's the real limit a single client can hit?

2. With CAP = 5 and RATE = 1/sec, a client sends 8 requests in a single second. How many reach the origin?