A patient mail carrier: it takes your message off a durable queue, knocks on the provider's door, and if no one answers, waits a little longer each time before knocking again.
People often say "notifications" to mean the whole feed-and-fan-out system. This page is about a narrower, grittier piece: the outbound delivery worker that actually hands each message to email, SMS, or push providers — and refuses to lose it.
The shape is always the same. A durable queue decouples whoever produces a notification from the worker that sends it. The worker dequeues a message, checks the recipient's preferences and quiet hours, dedups against an idempotency key, routes it to the right channel, and calls a third-party provider. Providers fail constantly in small, recoverable ways, so the worker retries with exponential backoff. When something can never succeed, the message goes to a dead-letter queue instead of spinning forever.
The worker is a loop. It pulls one message, refuses to send a duplicate (the idempotency key it already delivered), tries the provider, and on a transient error it sleeps for a growing delay before retrying. The backoff is min(cap, base · 2^attempt) plus a little random jitter so a thousand workers don't all retry on the same beat. When the attempts run out — or the error is permanent — the message is parked in the dead-letter queue rather than blocking the line.
BASE, CAP, MAX_ATTEMPTS = 1.0, 30.0, 5 # seconds
def worker_loop(queue, dlq, store):
while (msg := queue.dequeue()) is not None:
# 1. dedup: have we already delivered this exact key?
if store.already_delivered(msg.idempotency_key):
queue.ack(msg) # safe no-op, drop it
continue
# 2. respect the recipient before spending a provider call
if not allowed_now(msg.recipient, msg.channel): # opt-out / quiet hours
queue.ack(msg)
continue
for attempt in range(MAX_ATTEMPTS):
try:
provider = route(msg.channel) # email / sms / push
provider.send(msg, idempotency_key=msg.idempotency_key)
store.mark_delivered(msg.idempotency_key) # record before ack
queue.ack(msg) # remove from queue
break
except PermanentError: # bad number, opted out, 400-class
dlq.push(msg, reason="permanent")
queue.ack(msg)
break
except TransientError: # 429 / 503 / timeout
if attempt == MAX_ATTEMPTS - 1:
dlq.push(msg, reason="retries_exhausted")
queue.ack(msg)
break
delay = min(CAP, BASE * 2 ** attempt)
delay += random.uniform(0, delay) # full jitter
sleep(delay) # then retry
Note the order: mark delivered, then ack. If the worker crashes after sending but before acking, the message reappears and is retried — but the dedup check (and the provider's own idempotency key) make that re-send a harmless no-op. Acking before the send would silently drop messages on a crash.
| Choice | You get | You pay |
|---|---|---|
| At-least-once delivery | Never silently drops a message; simple to reason about | Duplicates are possible — needs idempotency keys to stay safe |
| Exactly-once delivery | No duplicates, ever | Effectively unreachable across a third-party boundary; you approximate it with at-least-once + dedup |
| Durable queue (fsync, replicas) | Survives crashes; no lost messages | Higher enqueue latency and storage cost per message |
| Aggressive retries | Rides out brief provider blips | Retry storms can amplify an outage; needs jitter and a ceiling |
| Giving up early (small max-attempts) | Frees the worker fast; bounded blast radius | Drops messages a longer retry would have delivered |
| Per-provider rate limits | Stays inside the provider's quota; fewer 429s | Caps throughput; bursts must buffer in the queue |
| Per-message delivery state | You can answer "did it send?" and dedup correctly | A row (or key) per message — real storage at scale |
400 invalid phone number or an opted-out recipient will never succeed. Retrying it five times wastes quota and delays everything behind it — classify errors and dead-letter permanent ones immediately.base · 2^attempt, they all retry in lockstep and hammer a recovering provider in synchronized waves. Add randomness so the herd spreads out.429s, then throttling, then a worse outage than you started with. Track tokens per provider and pace sends.Message A — transient 503, then delivered. An order-shipped email is enqueued (key=ord-91:shipped). The worker dequeues it, sees the recipient hasn't opted out and isn't in quiet hours, routes to the email provider, and calls send.
Attempt 1 returns 503 service unavailable — transient. With base=1s, the delay is min(30, 1·2^0)=1s, plus jitter, so it waits roughly 1–2s and retries. Attempt 2 returns 503 again; delay is min(30, 1·2^1)=2s plus jitter (~2–4s). Attempt 3 returns 202 accepted — delivered. Because the same idempotency key rode along on every attempt, even if attempt 1 had actually sent before the connection dropped, the provider would have deduped it. Delivered count +1, message acked.
Message B — dead-lettered. An SMS to +1-555-0000 is enqueued. The worker routes to the SMS provider, which returns 400 invalid number — a permanent error. There is no point retrying a number that can't exist, so the worker pushes the message to the dead-letter queue with reason="permanent" and acks it. The DLQ count goes to 1; an operator (or a repair job) can inspect it later, but the live worker moves on instead of looping forever.
1. Your worker delivers at-least-once and you notice some users get the same push twice. What's the fix that keeps delivery reliable?
2. A provider returns 400 invalid phone number. How should the worker treat it?