Question
A blue-green cutover moves a gRPC streaming service (long-lived bidirectional streams, e.g. live telemetry) from blue to green at 11:00 by flipping the LB to green and then scaling blue to zero five minutes later. Green is healthy. New streams to green are fine. But at 11:05 when blue scales to zero, thousands of clients' in-flight streams drop simultaneously; clients reconnect in a thundering herd to green, whose CPU spikes to 100% for ~3 minutes and p99 stream-setup latency blows out before settling. Dashboards: green stream-setup error rate spikes at 11:05, connection count on green jumps vertically, blue's connection count was still ~8k when it was scaled to zero. Triage, explain the mechanism, and prevent it.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.