On-callHardoc-g466

Subject RollbackLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

v210 of the `notifications` consumer is rolled back at 16:00 after a latency regression. The rollback is rolling, and the deploy tool reports 'complete, 100% on v209' at 16:05. But a subset of notifications keep getting dropped silently for another 30 minutes. Context: v210 had introduced a new compaction format for the per-user dedup state it stores in a shared Redis hash, and it had been writing that new format for the ~25 minutes it was live. v209's reader expects the old format and treats unparseable dedup entries as 'already sent,' so it skips them. Dashboards: consumer lag is zero, no errors thrown (the skip is silent), Redis is healthy. Triage and explain why a 'complete' rollback didn't stop the bleeding, then mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.