Stop the flood at the front door so your origin only ever sees a steady trickle.
A reverse proxy (or API gateway) sits in front of your real servers and receives every request first. Before it forwards anything to the origin, it can decide: is this caller asking too fast? If so, it answers right there with 429 Too Many Requests and the origin never even hears about it.
The classic way to make that decision is a token bucket. Each client (per IP or per API key) gets a bucket that holds up to CAP tokens and refills at a steady RATE. Every allowed request spends one token. A short burst can drain the bucket fast — but once it's empty, further requests get rejected until the bucket trickles back up.
The gateway keeps one bucket per key — usually per API key, or per client IP. On each request it first lazily refills the bucket based on how much time has passed (no background timer needed): tokens go up by elapsed × RATE, capped at CAP. Then if at least one token is available, it spends one and forwards the request to the origin; otherwise it short-circuits with 429.
CAP = 5 # bucket capacity (max burst)
RATE = 1 # tokens added per second
def allow(key, now):
b = bucket[key]
# lazy refill: credit tokens for elapsed time, capped at CAP
b.tokens = min(CAP, b.tokens + (now - b.ts) * RATE)
b.ts = now
if b.tokens >= 1:
b.tokens -= 1
return True # forward to origin
return False # 429 Too Many Requests
# at the edge, before proxying:
if not allow(client_key, time.monotonic()):
return Response(status=429, headers={"Retry-After": "1"})
forward_to_origin(request)
Enforcing this at the edge matters: a rejected request costs the gateway a tiny token check, while the origin — your database, your business logic — does zero work. The Retry-After header tells well-behaved clients exactly how long to back off.
| Approach | Burst behavior | Note |
|---|---|---|
| Token bucket | Allows bursts up to CAP, then steady RATE |
Smooth and forgiving; one counter + timestamp per key |
| Fixed window | Up to 2× limit across a window boundary |
Cheapest (one counter), but boundary bursts slip through |
| Sliding log | Exact — no boundary spike | Most accurate, but stores a timestamp per request (memory heavy) |
| Per-edge local count | Fast, zero network hop | With N edges the real limit is N× the intended one |
| Shared (Redis) count | Accurate across the whole cluster | Adds a round-trip of latency to every request |
11:59:59 and another full window at 12:00:00 — double the intended rate in two seconds. A token bucket or sliding window avoids this.X-Forwarded-For. Prefer authenticated API keys, and only trust forwarded IPs from your own proxies.Retry-After header. Without it, clients retry blindly and hammer you harder. Always tell them when to come back.429s can still strain the edge. Keep the reject path cheap and consider connection-level limits too.Capacity CAP = 5, refill RATE = 1 token/second. The bucket starts full. A client fires a burst of 8 requests within one second — too quick for any meaningful refill:
start: tokens = 5
r1 spend -> tokens 4 allowed -> origin
r2 spend -> tokens 3 allowed -> origin
r3 spend -> tokens 2 allowed -> origin
r4 spend -> tokens 1 allowed -> origin
r5 spend -> tokens 0 allowed -> origin
r6 empty -> 429 Too Many Requests (Retry-After: 1)
r7 empty -> 429
r8 empty -> 429
--- 1 second later, lazy refill adds 1 token ---
tokens = 1
r9 spend -> tokens 0 allowed -> origin
So of the 8-request burst, 5 are allowed and 3 get 429. The origin only ever saw 5 requests. One second later the bucket has refilled by exactly one token, so the next request is allowed again — a steady trickle, as promised.
1. You run the same 100-requests-per-minute token-bucket limit independently on each of 4 gateway edges, with no shared store. What's the real limit a single client can hit?
2. With CAP = 5 and RATE = 1/sec, a client sends 8 requests in a single second. How many reach the origin?