Code Room
On-callHardoc-g054
Subject Bad rolloutLevel Senior–Staff~40 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology, Software development

Question

A platform team rolls out a new service-mesh sidecar (v2) to the inventory service via a 50% canary at 13:00. The 50% of inventory pods with the v2 sidecar start failing all outbound calls to the pricing service with 'no healthy upstream', while the 50% on v1 are fine. Pricing service itself is healthy and serving the v1-sidecar pods normally. The mesh control plane shows pricing's endpoints as healthy. The v2 sidecar release notes mention a change to how it resolves upstream service names. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.