The notification sender

A patient mail carrier: it takes your message off a durable queue, knocks on the provider's door, and if no one answers, waits a little longer each time before knocking again.

The idea

People often say "notifications" to mean the whole feed-and-fan-out system. This page is about a narrower, grittier piece: the outbound delivery worker that actually hands each message to email, SMS, or push providers — and refuses to lose it.

The shape is always the same. A durable queue decouples whoever produces a notification from the worker that sends it. The worker dequeues a message, checks the recipient's preferences and quiet hours, dedups against an idempotency key, routes it to the right channel, and calls a third-party provider. Providers fail constantly in small, recoverable ways, so the worker retries with exponential backoff. When something can never succeed, the message goes to a dead-letter queue instead of spinning forever.

See it work

provider mode:

Press play to send three messages through the pipeline.

How it works

The worker is a loop. It pulls one message, refuses to send a duplicate (the idempotency key it already delivered), tries the provider, and on a transient error it sleeps for a growing delay before retrying. The backoff is min(cap, base · 2^attempt) plus a little random jitter so a thousand workers don't all retry on the same beat. When the attempts run out — or the error is permanent — the message is parked in the dead-letter queue rather than blocking the line.

BASE, CAP, MAX_ATTEMPTS = 1.0, 30.0, 5   # seconds

def worker_loop(queue, dlq, store):
    while (msg := queue.dequeue()) is not None:
        # 1. dedup: have we already delivered this exact key?
        if store.already_delivered(msg.idempotency_key):
            queue.ack(msg)                 # safe no-op, drop it
            continue

        # 2. respect the recipient before spending a provider call
        if not allowed_now(msg.recipient, msg.channel):   # opt-out / quiet hours
            queue.ack(msg)
            continue

        for attempt in range(MAX_ATTEMPTS):
            try:
                provider = route(msg.channel)              # email / sms / push
                provider.send(msg, idempotency_key=msg.idempotency_key)
                store.mark_delivered(msg.idempotency_key)  # record before ack
                queue.ack(msg)                             # remove from queue
                break
            except PermanentError:        # bad number, opted out, 400-class
                dlq.push(msg, reason="permanent")
                queue.ack(msg)
                break
            except TransientError:        # 429 / 503 / timeout
                if attempt == MAX_ATTEMPTS - 1:
                    dlq.push(msg, reason="retries_exhausted")
                    queue.ack(msg)
                    break
                delay = min(CAP, BASE * 2 ** attempt)
                delay += random.uniform(0, delay)          # full jitter
                sleep(delay)                               # then retry

Note the order: mark delivered, then ack. If the worker crashes after sending but before acking, the message reappears and is retried — but the dedup check (and the provider's own idempotency key) make that re-send a harmless no-op. Acking before the send would silently drop messages on a crash.

Cost / trade-offs

Choice	You get	You pay
At-least-once delivery	Never silently drops a message; simple to reason about	Duplicates are possible — needs idempotency keys to stay safe
Exactly-once delivery	No duplicates, ever	Effectively unreachable across a third-party boundary; you approximate it with at-least-once + dedup
Durable queue (fsync, replicas)	Survives crashes; no lost messages	Higher enqueue latency and storage cost per message
Aggressive retries	Rides out brief provider blips	Retry storms can amplify an outage; needs jitter and a ceiling
Giving up early (small max-attempts)	Frees the worker fast; bounded blast radius	Drops messages a longer retry would have delivered
Per-provider rate limits	Stays inside the provider's quota; fewer 429s	Caps throughput; bursts must buffer in the queue
Per-message delivery state	You can answer "did it send?" and dedup correctly	A row (or key) per message — real storage at scale

Watch out for

No idempotency key. At-least-once delivery will re-deliver on retry or crash recovery. Without a key the recipient gets the same text twice. Generate the key when the notification is created, not when you send.
Retrying permanent failures. A 400 invalid phone number or an opted-out recipient will never succeed. Retrying it five times wastes quota and delays everything behind it — classify errors and dead-letter permanent ones immediately.
Backoff without jitter. If every worker uses the exact same base · 2^attempt, they all retry in lockstep and hammer a recovering provider in synchronized waves. Add randomness so the herd spreads out.
Ignoring provider rate limits. Blasting past the provider's quota earns you 429s, then throttling, then a worse outage than you started with. Track tokens per provider and pace sends.
Acking before the send. Remove the message from the queue only after delivery is recorded. Ack-then-send loses messages on any crash in between.
Quiet-hours and timezone bugs. "10pm" is meaningless without the recipient's timezone. Compute quiet hours in their local time, or you'll wake people at 3am.
Unbounded retries. A message that never succeeds and never gives up blocks the worker (or pins a partition). Always cap attempts and dead-letter on exhaustion.

Worked example

Message A — transient 503, then delivered. An order-shipped email is enqueued (key=ord-91:shipped). The worker dequeues it, sees the recipient hasn't opted out and isn't in quiet hours, routes to the email provider, and calls send.

Attempt 1 returns 503 service unavailable — transient. With base=1s, the delay is min(30, 1·2^0)=1s, plus jitter, so it waits roughly 1–2s and retries. Attempt 2 returns 503 again; delay is min(30, 1·2^1)=2s plus jitter (~2–4s). Attempt 3 returns 202 accepted — delivered. Because the same idempotency key rode along on every attempt, even if attempt 1 had actually sent before the connection dropped, the provider would have deduped it. Delivered count +1, message acked.

Message B — dead-lettered. An SMS to +1-555-0000 is enqueued. The worker routes to the SMS provider, which returns 400 invalid number — a permanent error. There is no point retrying a number that can't exist, so the worker pushes the message to the dead-letter queue with reason="permanent" and acks it. The DLQ count goes to 1; an operator (or a repair job) can inspect it later, but the live worker moves on instead of looping forever.

Check yourself

1. Your worker delivers at-least-once and you notice some users get the same push twice. What's the fix that keeps delivery reliable?

2. A provider returns 400 invalid phone number. How should the worker treat it?