Push events to someone else's server you don't control — and keep retrying calmly until it answers or you give up.
Something happens in your system — a payment clears, a file finishes processing. You promised a customer their server would hear about it, so the gateway delivers the event with an HTTP POST to their endpoint.
But their server might be mid-deploy, overloaded, or simply slow. So deliveries don't go out directly: they ride a queue, and a worker retries failures with exponential backoff — waiting longer after each miss. After a handful of failed attempts the event is parked in a dead-letter queue for later inspection and manual replay, instead of retrying forever.
The delivery worker pulls an event and loops over attempts. Each attempt signs the body, POSTs it, and classifies the response. 2xx means delivered — ack and stop. A 4xx other than 408 or 429 is a permanent client error — don't retry, park it. Everything else (408, 429, 5xx, connection failures, timeouts) is retryable.
Before each retry the worker waits base × 2^attempt with full jitter — a random fraction of that delay — so a fleet of stuck deliveries doesn't all retry in lockstep when the endpoint recovers. After max_attempts, the event goes to the dead-letter queue.
import time, random, hmac, hashlib
def deliver(event, endpoint, secret, base=1.0, max_attempts=5):
body = serialize(event)
# HMAC-sign the body so the receiver can verify authenticity
sig = hmac.new(secret, body, hashlib.sha256).hexdigest()
headers = {"X-Signature": "sha256=" + sig,
"X-Event-Id": event.id} # receiver dedups on this
for attempt in range(max_attempts): # attempt = 0, 1, 2, ...
status = http_post(endpoint, body, headers, timeout=10)
if 200 <= status < 300:
return "delivered" # 2xx -> ack, done
if 400 <= status < 500 and status not in (408, 429):
return park_in_dlq(event, status) # permanent 4xx, no retry
# retryable: 408, 429, 5xx, timeout, connection error
if attempt == max_attempts - 1:
break
delay = base * (2 ** attempt) # 1, 2, 4, 8, ...
time.sleep(random.random() * delay) # full jitter
return park_in_dlq(event, status) # exhausted -> dead-letter
Because the worker may POST more than once (a slow endpoint can process a request and still time out before its 2xx reaches us), delivery is at-least-once. The receiver dedups on X-Event-Id to stay idempotent.
| Property | Effect |
|---|---|
| Attempts per event | Up to N (e.g. 5) before the dead-letter queue |
| Total wait before parking | ≈ base × (2^N − 1) — base 1s, N=5 → up to ~31s spread across the retries |
| Storage | The dead-letter queue holds every exhausted or permanently-failed event for replay |
| Trade-off | Durability and eventual delivery, paid for in delivery latency and duplicate-delivery risk |
400 or 422 won't fix itself; retrying wastes work and can hammer a misconfigured endpoint. Park permanent failures immediately. (But 408 and 429 are retryable.)base × 2^n, they all fire together the instant the endpoint recovers — a thundering herd that knocks it back down. Add full jitter.A payment system emits charge.succeeded for order evt_8f31 and the gateway POSTs it to the merchant. The merchant's server is mid-deploy and returns 503 for about six seconds. Attempt 1 fails; the worker waits ~1s (base 1s, jittered) and retries → 503 again. It waits ~2s → this time a 429 while the new instance warms up. It waits ~4s → the merchant is back and returns 200 OK on the fourth attempt. Delivered and acked.
One earlier attempt actually reached the merchant and recorded the charge before its response timed out — so this looks like a double delivery. But the merchant keyed on X-Event-Id: evt_8f31, saw it had already applied that id, and treated the repeat as a no-op. The customer is charged once.
The endpoint returns 400 Bad Request. Should the gateway keep retrying?