Question
Your payments side-effects pipeline uses the Kafka tiered-retry pattern: `payments` (main) → on handler failure, produce to `payments.retry.5s` → consumer waits and re-attempts → on failure produce to `payments.retry.30s` → on failure produce to `payments.DLT` (dead-letter topic). At 08:30 an alert fires: `payments.DLT` lag/size is climbing ~600 msg/min, but the *main* `payments` topic looks healthy (normal lag, normal error rate). Dashboards: the retry-topic consumers show a steady redelivery flow, and the DLT producer rate exactly tracks the retry-30s failure rate. Recent context: a downstream `ledger` API started returning HTTP 429 (rate-limited) at 08:25 after a quota change. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.