On-callHardoc-g377

Subject Dead letter overflowLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your payments side-effects pipeline uses the Kafka tiered-retry pattern: `payments` (main) → on handler failure, produce to `payments.retry.5s` → consumer waits and re-attempts → on failure produce to `payments.retry.30s` → on failure produce to `payments.DLT` (dead-letter topic). At 08:30 an alert fires: `payments.DLT` lag/size is climbing ~600 msg/min, but the *main* `payments` topic looks healthy (normal lag, normal error rate). Dashboards: the retry-topic consumers show a steady redelivery flow, and the DLT producer rate exactly tracks the retry-30s failure rate. Recent context: a downstream `ledger` API started returning HTTP 429 (rate-limited) at 08:25 after a quota change. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.