Code Room
System designHard
Question
Design dead-letter and replay tooling for a large event platform. When a consumer fails a message N times it goes to a DLQ. You need operators to: inspect DLQ messages, see the failure reason and original headers, edit/fix a payload, and replay one message, a filtered subset, or an entire DLQ back into the original topic — without re-triggering already-succeeded downstream side effects or causing an infinite poison loop. Scale: 10K DLQ messages/min during a bad deploy, replay batches of millions.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.