Question
To fix a bad earlier computation, a team reset a consumer group's offsets to replay 30 days of a high-volume `transactions` topic through the enrichment pipeline. Twenty minutes later, three downstream services are on fire: the enrichment service is fine, but the notification service it calls is emitting 30 days of 'transaction processed' push notifications to real users in real time, a third-party fraud API is being rate-limited (429s) and shedding live traffic, and an audit topic is filling 30x faster than disk reclaims. The replay is ~10% done. How do you triage, stop the harmful side-effects, and complete the reprocessing safely?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.