Rate limiting with a token bucket

Each request spends a token; the bucket refills at a steady rate, so short bursts are fine but the long-run rate stays capped.

The idea

Picture a small bucket that a faucet drips tokens into at a steady pace. The bucket can only hold so many — say 5. Every time a request arrives, it tries to take one token. If a token is there, the request is allowed and the token is spent. If the bucket is empty, the request is rejected (an HTTP 429 Too Many Requests).

This gives you two guarantees at once. The capacity caps how big a sudden burst can be, and the refill rate caps the sustained throughput over time. Bursts are welcome up to the bucket size; anything faster than the faucet eventually gets throttled.

Press play, or step through one event at a time.

How it works

You don't run a background timer that adds a token every tick. Instead you lazily refill: whenever a request arrives, look at how much real time has passed since you last touched the bucket, add that many tokens (capped at capacity), then try to spend one.

class TokenBucket:
    def __init__(self, capacity, refill_per_sec):
        self.capacity = capacity          # max burst (bucket size)
        self.rate     = refill_per_sec    # tokens added per second
        self.tokens   = capacity          # start full
        self.last     = now()             # monotonic clock, e.g. time.monotonic()

    def allow(self):
        t = now()
        elapsed = max(0.0, t - self.last) # clamp: clocks can move oddly
        self.last = t

        # lazy refill: top up by however much time has passed, capped at capacity
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)

        if self.tokens >= 1:
            self.tokens -= 1
            return True          # allowed
        return False             # rejected (HTTP 429)

Tokens are a fractional float, so a refill rate of 2/sec adds 0.5 tokens after 250 ms — partial credit accrues smoothly between requests.

Trade-offs

Algorithm Bursts Memory per key Boundary spikes
Token bucket Allows up to capacity O(1) — tokens + timestamp Smoothed by design
Leaky bucket Smooths to a constant outflow O(1) — queue depth + timestamp None; rigid drain rate
Fixed window Cheap but coarse O(1) — one counter Up to 2× at window edges
Sliding window log Exact, smooth O(n) — a timestamp per request None; precise
Sliding window counter Smooth approximation O(1) — two counters Mostly avoided

Watch out for

Worked example

Take a bucket with capacity 5 and a refill rate of 1 token per second, starting full. A client fires a burst of 7 requests almost instantly. The first 5 each find a token and are allowed, draining the bucket to empty. Requests 6 and 7 arrive before any meaningful time has passed, find tokens < 1, and are rejected with a 429.

Now the client goes quiet. Over the next 2 seconds the lazy refill credits 2 × 1 = 2 tokens back into the bucket (still capped at 5). The next two requests find tokens waiting and are allowed; a third is rejected again. The burst was absorbed up to the bucket size, and the sustained rate settled at the faucet's 1/sec — exactly the two guarantees you wanted. Step through the visual above to watch the tokens drain and refill.

Check yourself

A bucket has capacity 5, refill 1 token/sec, and is currently empty. The client waits 2 seconds, then sends 4 requests back to back. How many are allowed?

Which single change best protects a token-bucket limiter running across many app servers?