Code Room
On-callHard
Question
During a routine rolling deploy of the orders service (20 pods, ~8 minutes to fully roll), you get intermittent 500s — roughly 1 in 4 requests fail with a deserialization error 'unknown enum value ORDER_PARTIALLY_REFUNDED'. The errors started as the rollout began and you expect them to persist. v51 (new) added a new enum value to the OrderStatus protobuf and writes it; v50 (old) reads from the same shared cache/queue and can't parse it. Both versions run simultaneously during the roll. Triage and mitigate.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.