Rate limiting with a token bucket

Each request spends a token; the bucket refills at a steady rate, so short bursts are fine but the long-run rate stays capped.

The idea

Picture a small bucket that a faucet drips tokens into at a steady pace. The bucket can only hold so many — say 5. Every time a request arrives, it tries to take one token. If a token is there, the request is allowed and the token is spent. If the bucket is empty, the request is rejected (an HTTP 429 Too Many Requests).

This gives you two guarantees at once. The capacity caps how big a sudden burst can be, and the refill rate caps the sustained throughput over time. Bursts are welcome up to the bucket size; anything faster than the faucet eventually gets throttled.

Press play, or step through one event at a time.

How it works

You don't run a background timer that adds a token every tick. Instead you lazily refill: whenever a request arrives, look at how much real time has passed since you last touched the bucket, add that many tokens (capped at capacity), then try to spend one.

class TokenBucket:
    def __init__(self, capacity, refill_per_sec):
        self.capacity = capacity          # max burst (bucket size)
        self.rate     = refill_per_sec    # tokens added per second
        self.tokens   = capacity          # start full
        self.last     = now()             # monotonic clock, e.g. time.monotonic()

    def allow(self):
        t = now()
        elapsed = max(0.0, t - self.last) # clamp: clocks can move oddly
        self.last = t

        # lazy refill: top up by however much time has passed, capped at capacity
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)

        if self.tokens >= 1:
            self.tokens -= 1
            return True          # allowed
        return False             # rejected (HTTP 429)

Tokens are a fractional float, so a refill rate of 2/sec adds 0.5 tokens after 250 ms — partial credit accrues smoothly between requests.

Trade-offs

Algorithm	Bursts	Memory per key	Boundary spikes
Token bucket	Allows up to capacity	O(1) — tokens + timestamp	Smoothed by design
Leaky bucket	Smooths to a constant outflow	O(1) — queue depth + timestamp	None; rigid drain rate
Fixed window	Cheap but coarse	O(1) — one counter	Up to 2× at window edges
Sliding window log	Exact, smooth	O(n) — a timestamp per request	None; precise
Sliding window counter	Smooth approximation	O(1) — two counters	Mostly avoided

Watch out for

Wall-clock for refill. Use a monotonic clock, not wall-clock time. If NTP steps the system clock backward, elapsed can go negative and you either grant a flood of tokens or stall the limiter. Clamp elapsed to be non-negative regardless.
Distributed state without atomicity. Behind several app servers, an in-memory bucket per process multiplies your real limit by the server count. Move the bucket to a shared store such as Redis and make read-refill-spend a single atomic step (a Lua script or an INCR-style primitive) so two concurrent requests can't both spend the last token.
Fixed-window boundary doubling. If you reach for a fixed window instead, a client can fire a full window's worth of requests just before the reset and another full window's worth just after — up to 2× the intended rate across the seam. Token bucket and sliding windows avoid this.
Per-key cardinality blowup. One bucket per user, per IP, or per API key sounds clean until you have millions of keys. Bound memory with TTL eviction on idle buckets, and be deliberate about your key — per-IP behind a shared NAT or proxy can punish many users as one.
No Retry-After header. On a 429, tell the client when to come back. Return a Retry-After (seconds until one token refills) so well-behaved clients back off instead of hammering and making things worse.

Worked example

Take a bucket with capacity 5 and a refill rate of 1 token per second, starting full. A client fires a burst of 7 requests almost instantly. The first 5 each find a token and are allowed, draining the bucket to empty. Requests 6 and 7 arrive before any meaningful time has passed, find tokens < 1, and are rejected with a 429.

Now the client goes quiet. Over the next 2 seconds the lazy refill credits 2 × 1 = 2 tokens back into the bucket (still capped at 5). The next two requests find tokens waiting and are allowed; a third is rejected again. The burst was absorbed up to the bucket size, and the sustained rate settled at the faucet's 1/sec — exactly the two guarantees you wanted. Step through the visual above to watch the tokens drain and refill.

Check yourself

A bucket has capacity 5, refill 1 token/sec, and is currently empty. The client waits 2 seconds, then sends 4 requests back to back. How many are allowed?

Which single change best protects a token-bucket limiter running across many app servers?