Rate limiting

A bucket of tokens that quietly refills — spend one per request, and when it runs dry, callers have to wait.

The idea

Picture a small bucket that holds up to capacity tokens. Tokens drip back in at a steady refill_rate per second, but the bucket never overflows past full. Every request that arrives scoops out one token.

If a token is there, the request is allowed. If the bucket is empty, the request is rejected with HTTP 429 (“too many requests”). A full bucket lets a quiet client fire a quick burst, but over time nobody can outrun the drip — so the long-run average is bounded by refill_rate.

See it work

Press play, or step through one event at a time.

How it works

The clever trick is that you don’t need a background timer adding tokens. You refill lazily: on each request, figure out how much time has passed since you last looked, add that many tokens (capped at capacity), then try to spend one.

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity        # max tokens (burst size)
        self.refill_rate = refill_rate  # tokens added per second
        self.tokens = capacity          # start full
        self.last = time.monotonic()    # monotonic, not wall-clock

    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last
        self.last = now
        # lazily add the tokens that accrued; never exceed capacity
        self.tokens = min(self.capacity,
                          self.tokens + elapsed * self.refill_rate)
        if self.tokens >= 1:
            self.tokens -= 1
            return True          # allowed
        return False             # denied -> respond 429

Fractional tokens are fine — they just accumulate. With refill_rate = 1/s, after 0.5 s you’ve earned half a token; it becomes spendable once it crosses 1.

Cost / trade-offs

DimensionToken bucket
Time per requestO(1) — a subtraction and a clamp
State per keyO(1) — two numbers: tokens and last
Burst behaviourAllows bursts up to capacity; smoother than a fixed window
StrictnessLess strict than leaky bucket (which paces output evenly)
Memory at scaleOne small record per client key — millions of keys add up
DistributedNeeds shared state (e.g. Redis) so nodes don’t each grant the full limit

Versus the alternatives. Leaky bucket drains at a fixed rate, smoothing output into a steady stream but disallowing bursts. Fixed window counts requests per calendar interval — simple, but lets a client send 2× the limit across a window boundary. Sliding window fixes that edge burst at the cost of more bookkeeping. Token bucket sits in between: cheap, and burst-friendly while still bounding the average.

Watch out for

Worked example

Take capacity = 5, refill_rate = 1/s, bucket starting full. A client fires 5 requests instantly: all 5 are allowed because the bucket had 5 tokens — that’s the burst. The bucket is now empty.

A 6th request arrives immediately: 0 tokens, so it’s rejected with 429, and a sensible Retry-After ≈ 1 s (the time to earn one token). If the client instead waits 3 seconds, the bucket has refilled to 3 tokens (capped at 5), so the next 3 requests pass. Over a long stretch, no matter how the requests clump, the client can’t exceed 1 request per second on average — exactly the refill rate.

Check yourself

1. Bucket is full at capacity = 5 with refill_rate = 1/s. A client sends 5 requests at once, then a 6th 200 ms later. What happens to the 6th?

2. You run the same limiter on 4 web servers, each with its own in-memory bucket, no shared state. What’s the real effective limit a single client can hit?