Code Room
On-callHardoc-g049
Subject Version skewLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

During a routine rolling deploy of the orders service (20 pods, ~8 minutes to fully roll), you get intermittent 500s — roughly 1 in 4 requests fail with a deserialization error 'unknown enum value ORDER_PARTIALLY_REFUNDED'. The errors started as the rollout began and you expect them to persist. v51 (new) added a new enum value to the OrderStatus protobuf and writes it; v50 (old) reads from the same shared cache/queue and can't parse it. Both versions run simultaneously during the roll. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.