A bucket of tokens that quietly refills — spend one per request, and when it runs dry, callers have to wait.
Picture a small bucket that holds up to capacity tokens. Tokens drip back in at a steady refill_rate per second, but the bucket never overflows past full. Every request that arrives scoops out one token.
If a token is there, the request is allowed. If the bucket is empty, the request is rejected with HTTP 429 (“too many requests”). A full bucket lets a quiet client fire a quick burst, but over time nobody can outrun the drip — so the long-run average is bounded by refill_rate.
The clever trick is that you don’t need a background timer adding tokens. You refill lazily: on each request, figure out how much time has passed since you last looked, add that many tokens (capped at capacity), then try to spend one.
import time
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity # max tokens (burst size)
self.refill_rate = refill_rate # tokens added per second
self.tokens = capacity # start full
self.last = time.monotonic() # monotonic, not wall-clock
def allow(self):
now = time.monotonic()
elapsed = now - self.last
self.last = now
# lazily add the tokens that accrued; never exceed capacity
self.tokens = min(self.capacity,
self.tokens + elapsed * self.refill_rate)
if self.tokens >= 1:
self.tokens -= 1
return True # allowed
return False # denied -> respond 429
Fractional tokens are fine — they just accumulate. With refill_rate = 1/s, after 0.5 s you’ve earned half a token; it becomes spendable once it crosses 1.
| Dimension | Token bucket |
|---|---|
| Time per request | O(1) — a subtraction and a clamp |
| State per key | O(1) — two numbers: tokens and last |
| Burst behaviour | Allows bursts up to capacity; smoother than a fixed window |
| Strictness | Less strict than leaky bucket (which paces output evenly) |
| Memory at scale | One small record per client key — millions of keys add up |
| Distributed | Needs shared state (e.g. Redis) so nodes don’t each grant the full limit |
Versus the alternatives. Leaky bucket drains at a fixed rate, smoothing output into a steady stream but disallowing bursts. Fixed window counts requests per calendar interval — simple, but lets a client send 2× the limit across a window boundary. Sliding window fixes that edge burst at the cost of more bookkeeping. Token bucket sits in between: cheap, and burst-friendly while still bounding the average.
elapsed can go negative and corrupt the count. Use a monotonic clock.Retry-After.Take capacity = 5, refill_rate = 1/s, bucket starting full. A client fires 5 requests instantly: all 5 are allowed because the bucket had 5 tokens — that’s the burst. The bucket is now empty.
A 6th request arrives immediately: 0 tokens, so it’s rejected with 429, and a sensible Retry-After ≈ 1 s (the time to earn one token). If the client instead waits 3 seconds, the bucket has refilled to 3 tokens (capped at 5), so the next 3 requests pass. Over a long stretch, no matter how the requests clump, the client can’t exceed 1 request per second on average — exactly the refill rate.
1. Bucket is full at capacity = 5 with refill_rate = 1/s. A client sends 5 requests at once, then a 6th 200 ms later. What happens to the 6th?
2. You run the same limiter on 4 web servers, each with its own in-memory bucket, no shared state. What’s the real effective limit a single client can hit?