On-callHardoc-g471

Subject Blue greenLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Telecom

Question

A blue-green cutover moves a gRPC streaming service (long-lived bidirectional streams, e.g. live telemetry) from blue to green at 11:00 by flipping the LB to green and then scaling blue to zero five minutes later. Green is healthy. New streams to green are fine. But at 11:05 when blue scales to zero, thousands of clients' in-flight streams drop simultaneously; clients reconnect in a thundering herd to green, whose CPU spikes to 100% for ~3 minutes and p99 stream-setup latency blows out before settling. Dashboards: green stream-setup error rate spikes at 11:05, connection count on green jumps vertically, blue's connection count was still ~8k when it was scaled to zero. Triage, explain the mechanism, and prevent it.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.