Webhook delivery gateway

Push events to someone else's server you don't control — and keep retrying calmly until it answers or you give up.

The idea

Something happens in your system — a payment clears, a file finishes processing. You promised a customer their server would hear about it, so the gateway delivers the event with an HTTP POST to their endpoint.

But their server might be mid-deploy, overloaded, or simply slow. So deliveries don't go out directly: they ride a queue, and a worker retries failures with exponential backoff — waiting longer after each miss. After a handful of failed attempts the event is parked in a dead-letter queue for later inspection and manual replay, instead of retrying forever.

Press play to follow one event through retries, backoff, and an eventual 200 OK — then keep going to see permanent failures parked in the dead-letter queue.

How it works

The delivery worker pulls an event and loops over attempts. Each attempt signs the body, POSTs it, and classifies the response. 2xx means delivered — ack and stop. A 4xx other than 408 or 429 is a permanent client error — don't retry, park it. Everything else (408, 429, 5xx, connection failures, timeouts) is retryable.

Before each retry the worker waits base × 2^attempt with full jitter — a random fraction of that delay — so a fleet of stuck deliveries doesn't all retry in lockstep when the endpoint recovers. After max_attempts, the event goes to the dead-letter queue.

import time, random, hmac, hashlib

def deliver(event, endpoint, secret, base=1.0, max_attempts=5):
    body = serialize(event)
    # HMAC-sign the body so the receiver can verify authenticity
    sig = hmac.new(secret, body, hashlib.sha256).hexdigest()
    headers = {"X-Signature": "sha256=" + sig,
               "X-Event-Id": event.id}          # receiver dedups on this

    for attempt in range(max_attempts):          # attempt = 0, 1, 2, ...
        status = http_post(endpoint, body, headers, timeout=10)

        if 200 <= status < 300:
            return "delivered"                   # 2xx -> ack, done
        if 400 <= status < 500 and status not in (408, 429):
            return park_in_dlq(event, status)    # permanent 4xx, no retry

        # retryable: 408, 429, 5xx, timeout, connection error
        if attempt == max_attempts - 1:
            break
        delay = base * (2 ** attempt)            # 1, 2, 4, 8, ...
        time.sleep(random.random() * delay)      # full jitter

    return park_in_dlq(event, status)            # exhausted -> dead-letter

Because the worker may POST more than once (a slow endpoint can process a request and still time out before its 2xx reaches us), delivery is at-least-once. The receiver dedups on X-Event-Id to stay idempotent.

Cost

Property	Effect
Attempts per event	Up to `N` (e.g. 5) before the dead-letter queue
Total wait before parking	≈ `base × (2^N − 1)` — base 1s, N=5 → up to ~31s spread across the retries
Storage	The dead-letter queue holds every exhausted or permanently-failed event for replay
Trade-off	Durability and eventual delivery, paid for in delivery latency and duplicate-delivery risk

Watch out for

At-least-once means duplicates. A timed-out attempt may have been processed; the retry then delivers it again. Receivers must be idempotent — dedup on the event id and ignore one they've already applied.
Retrying non-retryable 4xx. A 400 or 422 won't fix itself; retrying wastes work and can hammer a misconfigured endpoint. Park permanent failures immediately. (But 408 and 429 are retryable.)
No jitter. If every stuck delivery retries at exactly base × 2^n, they all fire together the instant the endpoint recovers — a thundering herd that knocks it back down. Add full jitter.
Unbounded retries / no dead-letter queue. Retrying forever lets the queue grow without limit and starves healthy deliveries. Cap attempts and park the rest for manual replay.
Unsigned payloads. Without an HMAC signature the receiver can't tell your event from a spoofed one. Sign the body so they can verify before acting — and give each POST a per-request timeout so one slow consumer doesn't block the queue.

Worked example

A payment system emits charge.succeeded for order evt_8f31 and the gateway POSTs it to the merchant. The merchant's server is mid-deploy and returns 503 for about six seconds. Attempt 1 fails; the worker waits ~1s (base 1s, jittered) and retries → 503 again. It waits ~2s → this time a 429 while the new instance warms up. It waits ~4s → the merchant is back and returns 200 OK on the fourth attempt. Delivered and acked.

One earlier attempt actually reached the merchant and recorded the charge before its response timed out — so this looks like a double delivery. But the merchant keyed on X-Event-Id: evt_8f31, saw it had already applied that id, and treated the repeat as a no-op. The customer is charged once.

Check yourself

The endpoint returns 400 Bad Request. Should the gateway keep retrying?