Each request spends a token; the bucket refills at a steady rate, so short bursts are fine but the long-run rate stays capped.
Picture a small bucket that a faucet drips tokens into at a steady pace. The bucket can only hold so many — say 5. Every time a request arrives, it tries to take one token. If a token is there, the request is allowed and the token is spent. If the bucket is empty, the request is rejected (an HTTP 429 Too Many Requests).
This gives you two guarantees at once. The capacity caps how big a sudden burst can be, and the refill rate caps the sustained throughput over time. Bursts are welcome up to the bucket size; anything faster than the faucet eventually gets throttled.
You don't run a background timer that adds a token every tick. Instead you lazily refill: whenever a request arrives, look at how much real time has passed since you last touched the bucket, add that many tokens (capped at capacity), then try to spend one.
class TokenBucket:
def __init__(self, capacity, refill_per_sec):
self.capacity = capacity # max burst (bucket size)
self.rate = refill_per_sec # tokens added per second
self.tokens = capacity # start full
self.last = now() # monotonic clock, e.g. time.monotonic()
def allow(self):
t = now()
elapsed = max(0.0, t - self.last) # clamp: clocks can move oddly
self.last = t
# lazy refill: top up by however much time has passed, capped at capacity
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
if self.tokens >= 1:
self.tokens -= 1
return True # allowed
return False # rejected (HTTP 429)
Tokens are a fractional float, so a refill rate of 2/sec adds 0.5 tokens after 250 ms — partial credit accrues smoothly between requests.
| Algorithm | Bursts | Memory per key | Boundary spikes |
|---|---|---|---|
| Token bucket | Allows up to capacity | O(1) — tokens + timestamp | Smoothed by design |
| Leaky bucket | Smooths to a constant outflow | O(1) — queue depth + timestamp | None; rigid drain rate |
| Fixed window | Cheap but coarse | O(1) — one counter | Up to 2× at window edges |
| Sliding window log | Exact, smooth | O(n) — a timestamp per request | None; precise |
| Sliding window counter | Smooth approximation | O(1) — two counters | Mostly avoided |
elapsed can go negative and you either grant a flood of tokens or stall the limiter. Clamp elapsed to be non-negative regardless.INCR-style primitive) so two concurrent requests can't both spend the last token.Retry-After header. On a 429, tell the client when to come back. Return a Retry-After (seconds until one token refills) so well-behaved clients back off instead of hammering and making things worse.Take a bucket with capacity 5 and a refill rate of 1 token per second, starting full. A client fires a burst of 7 requests almost instantly. The first 5 each find a token and are allowed, draining the bucket to empty. Requests 6 and 7 arrive before any meaningful time has passed, find tokens < 1, and are rejected with a 429.
Now the client goes quiet. Over the next 2 seconds the lazy refill credits 2 × 1 = 2 tokens back into the bucket (still capped at 5). The next two requests find tokens waiting and are allowed; a third is rejected again. The burst was absorbed up to the bucket size, and the sustained rate settled at the faucet's 1/sec — exactly the two guarantees you wanted. Step through the visual above to watch the tokens drain and refill.
A bucket has capacity 5, refill 1 token/sec, and is currently empty. The client waits 2 seconds, then sends 4 requests back to back. How many are allowed?
Which single change best protects a token-bucket limiter running across many app servers?