Question
A bad deploy of the `pricing` service is rolled back at 16:05 after a latency regression. The rollback uses the same rolling strategy and is reported 'complete' at 16:09. But errors don't fully clear — they settle at ~4% with `proto: cannot parse invalid wire-format data` on the `quote` RPC, intermittent and seemingly random across requests. Dashboards: the deploy tool shows 100% of `pricing` pods on the old (rolled-back) version; but the *caller*, `checkout`, was deployed an hour earlier and is still on its new version. The new `pricing` had added a field to the protobuf response (field number 9, a new nested message); `checkout`'s new version started reading field 9. After rollback, old `pricing` no longer sends field 9, but ~4% of in-flight gRPC connections are sticky/long-lived. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.