Webhook delivery gateway

Push events to someone else's server you don't control — and keep retrying calmly until it answers or you give up.

The idea

Something happens in your system — a payment clears, a file finishes processing. You promised a customer their server would hear about it, so the gateway delivers the event with an HTTP POST to their endpoint.

But their server might be mid-deploy, overloaded, or simply slow. So deliveries don't go out directly: they ride a queue, and a worker retries failures with exponential backoff — waiting longer after each miss. After a handful of failed attempts the event is parked in a dead-letter queue for later inspection and manual replay, instead of retrying forever.

Press play to follow one event through retries, backoff, and an eventual 200 OK — then keep going to see permanent failures parked in the dead-letter queue.

How it works

The delivery worker pulls an event and loops over attempts. Each attempt signs the body, POSTs it, and classifies the response. 2xx means delivered — ack and stop. A 4xx other than 408 or 429 is a permanent client error — don't retry, park it. Everything else (408, 429, 5xx, connection failures, timeouts) is retryable.

Before each retry the worker waits base × 2^attempt with full jitter — a random fraction of that delay — so a fleet of stuck deliveries doesn't all retry in lockstep when the endpoint recovers. After max_attempts, the event goes to the dead-letter queue.

import time, random, hmac, hashlib

def deliver(event, endpoint, secret, base=1.0, max_attempts=5):
    body = serialize(event)
    # HMAC-sign the body so the receiver can verify authenticity
    sig = hmac.new(secret, body, hashlib.sha256).hexdigest()
    headers = {"X-Signature": "sha256=" + sig,
               "X-Event-Id": event.id}          # receiver dedups on this

    for attempt in range(max_attempts):          # attempt = 0, 1, 2, ...
        status = http_post(endpoint, body, headers, timeout=10)

        if 200 <= status < 300:
            return "delivered"                   # 2xx -> ack, done
        if 400 <= status < 500 and status not in (408, 429):
            return park_in_dlq(event, status)    # permanent 4xx, no retry

        # retryable: 408, 429, 5xx, timeout, connection error
        if attempt == max_attempts - 1:
            break
        delay = base * (2 ** attempt)            # 1, 2, 4, 8, ...
        time.sleep(random.random() * delay)      # full jitter

    return park_in_dlq(event, status)            # exhausted -> dead-letter

Because the worker may POST more than once (a slow endpoint can process a request and still time out before its 2xx reaches us), delivery is at-least-once. The receiver dedups on X-Event-Id to stay idempotent.

Cost

PropertyEffect
Attempts per eventUp to N (e.g. 5) before the dead-letter queue
Total wait before parkingbase × (2^N − 1) — base 1s, N=5 → up to ~31s spread across the retries
StorageThe dead-letter queue holds every exhausted or permanently-failed event for replay
Trade-offDurability and eventual delivery, paid for in delivery latency and duplicate-delivery risk

Watch out for

Worked example

A payment system emits charge.succeeded for order evt_8f31 and the gateway POSTs it to the merchant. The merchant's server is mid-deploy and returns 503 for about six seconds. Attempt 1 fails; the worker waits ~1s (base 1s, jittered) and retries → 503 again. It waits ~2s → this time a 429 while the new instance warms up. It waits ~4s → the merchant is back and returns 200 OK on the fourth attempt. Delivered and acked.

One earlier attempt actually reached the merchant and recorded the charge before its response timed out — so this looks like a double delivery. But the merchant keyed on X-Event-Id: evt_8f31, saw it had already applied that id, and treated the repeat as a no-op. The customer is charged once.

Check yourself

The endpoint returns 400 Bad Request. Should the gateway keep retrying?