Question
v210 of the `notifications` consumer is rolled back at 16:00 after a latency regression. The rollback is rolling, and the deploy tool reports 'complete, 100% on v209' at 16:05. But a subset of notifications keep getting dropped silently for another 30 minutes. Context: v210 had introduced a new compaction format for the per-user dedup state it stores in a shared Redis hash, and it had been writing that new format for the ~25 minutes it was live. v209's reader expects the old format and treats unparseable dedup entries as 'already sent,' so it skips them. Dashboards: consumer lag is zero, no errors thrown (the skip is silent), Redis is healthy. Triage and explain why a 'complete' rollback didn't stop the bleeding, then mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.