Rate limiting

A bucket of tokens that quietly refills — spend one per request, and when it runs dry, callers have to wait.

The idea

Picture a small bucket that holds up to capacity tokens. Tokens drip back in at a steady refill_rate per second, but the bucket never overflows past full. Every request that arrives scoops out one token.

If a token is there, the request is allowed. If the bucket is empty, the request is rejected with HTTP 429 (“too many requests”). A full bucket lets a quiet client fire a quick burst, but over time nobody can outrun the drip — so the long-run average is bounded by refill_rate.

See it work

Capacity 5 Refill rate 1/s

Press play, or step through one event at a time.

How it works

The clever trick is that you don’t need a background timer adding tokens. You refill lazily: on each request, figure out how much time has passed since you last looked, add that many tokens (capped at capacity), then try to spend one.

import time

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity        # max tokens (burst size)
        self.refill_rate = refill_rate  # tokens added per second
        self.tokens = capacity          # start full
        self.last = time.monotonic()    # monotonic, not wall-clock

    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last
        self.last = now
        # lazily add the tokens that accrued; never exceed capacity
        self.tokens = min(self.capacity,
                          self.tokens + elapsed * self.refill_rate)
        if self.tokens >= 1:
            self.tokens -= 1
            return True          # allowed
        return False             # denied -> respond 429

Fractional tokens are fine — they just accumulate. With refill_rate = 1/s, after 0.5 s you’ve earned half a token; it becomes spendable once it crosses 1.

Cost / trade-offs

Dimension	Token bucket
Time per request	`O(1)` — a subtraction and a clamp
State per key	`O(1)` — two numbers: `tokens` and `last`
Burst behaviour	Allows bursts up to `capacity`; smoother than a fixed window
Strictness	Less strict than leaky bucket (which paces output evenly)
Memory at scale	One small record per client key — millions of keys add up
Distributed	Needs shared state (e.g. Redis) so nodes don’t each grant the full limit

Versus the alternatives. Leaky bucket drains at a fixed rate, smoothing output into a steady stream but disallowing bursts. Fixed window counts requests per calendar interval — simple, but lets a client send 2× the limit across a window boundary. Sliding window fixes that edge burst at the cost of more bookkeeping. Token bucket sits in between: cheap, and burst-friendly while still bounding the average.

Watch out for

Wall-clock vs monotonic. Refill math uses elapsed time. If you read a wall clock and it jumps backward (NTP correction, daylight saving), elapsed can go negative and corrupt the count. Use a monotonic clock.
Per-key memory blowup. One bucket per API key or IP is fine until you have millions of them. Evict idle keys (a full bucket that hasn’t been touched is indistinguishable from a fresh one) or use a TTL.
Eager timer refills. Adding tokens on a background timer for every key wastes CPU. Refill lazily on access — compute the delta only when a request actually arrives.
Thundering herd on retry-after. If every rejected client is told to retry at the same instant, they all come back together and overwhelm you again. Add jitter to Retry-After.
Distributed multiplication. Run the limiter independently on N nodes and each grants the full limit, so the real ceiling is N×. Share state (Redis, a central store) or partition keys to specific nodes.

Worked example

Take capacity = 5, refill_rate = 1/s, bucket starting full. A client fires 5 requests instantly: all 5 are allowed because the bucket had 5 tokens — that’s the burst. The bucket is now empty.

A 6th request arrives immediately: 0 tokens, so it’s rejected with 429, and a sensible Retry-After ≈ 1 s (the time to earn one token). If the client instead waits 3 seconds, the bucket has refilled to 3 tokens (capped at 5), so the next 3 requests pass. Over a long stretch, no matter how the requests clump, the client can’t exceed 1 request per second on average — exactly the refill rate.

Check yourself

1. Bucket is full at capacity = 5 with refill_rate = 1/s. A client sends 5 requests at once, then a 6th 200 ms later. What happens to the 6th?

2. You run the same limiter on 4 web servers, each with its own in-memory bucket, no shared state. What’s the real effective limit a single client can hit?