Question
A blue-green cutover flips 100% of traffic from blue to green at once at 13:00. Green's pods are healthy and passed smoke tests. Immediately after the flip, p99 latency triples and DB read load spikes ~5x for ~6 minutes, then everything settles back to normal on its own. No errors, no rollback performed. Dashboards: app-pod CPU on green is high during the spike then normal; DB CPU and read IOPS spike then normal; the in-process LRU cache hit-rate on green starts at ~0% and climbs to ~95% over the same 6 minutes. Blue still shows 95% cache hit-rate the whole time. What happened, how would you have triaged it live, and how do you prevent it?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.