On-callHardoc-g276

Subject Version skewLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

During a 30-minute rolling deploy of the `profile` service, users intermittently see corrupted or stale profile data — sometimes another user's field, sometimes a missing avatar — at roughly the rate matching the fraction of pods on the new version, and it persists for a while *after* the deploy completes. Dashboards: shared Redis cache hit-rate is normal, no errors, no latency change. The new version changed the cached value's serialization from JSON to a packed binary format and kept the same cache keys; old pods write/read JSON under those keys, new pods write/read binary under the same keys, against a *shared* Redis. How do you triage and mitigate, and why does it persist after the deploy?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.