On-callMediumoc-g273

Subject Blue greenLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A blue-green cutover flips 100% of traffic from blue to green at once at 13:00. Green's pods are healthy and passed smoke tests. Immediately after the flip, p99 latency triples and DB read load spikes ~5x for ~6 minutes, then everything settles back to normal on its own. No errors, no rollback performed. Dashboards: app-pod CPU on green is high during the spike then normal; DB CPU and read IOPS spike then normal; the in-process LRU cache hit-rate on green starts at ~0% and climbs to ~95% over the same 6 minutes. Blue still shows 95% cache hit-rate the whole time. What happened, how would you have triaged it live, and how do you prevent it?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.