On-callHardoc-g122

Subject Backfill stormLevel Senior–Staff~40 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

To fix a bad earlier computation, a team reset a consumer group's offsets to replay 30 days of a high-volume `transactions` topic through the enrichment pipeline. Twenty minutes later, three downstream services are on fire: the enrichment service is fine, but the notification service it calls is emitting 30 days of 'transaction processed' push notifications to real users in real time, a third-party fraud API is being rate-limited (429s) and shedding live traffic, and an audit topic is filling 30x faster than disk reclaims. The replay is ~10% done. How do you triage, stop the harmful side-effects, and complete the reprocessing safely?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.