Question
During a 30-minute rolling deploy of the `profile` service, users intermittently see corrupted or stale profile data — sometimes another user's field, sometimes a missing avatar — at roughly the rate matching the fraction of pods on the new version, and it persists for a while *after* the deploy completes. Dashboards: shared Redis cache hit-rate is normal, no errors, no latency change. The new version changed the cached value's serialization from JSON to a packed binary format and kept the same cache keys; old pods write/read JSON under those keys, new pods write/read binary under the same keys, against a *shared* Redis. How do you triage and mitigate, and why does it persist after the deploy?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.