On-callHardoc-g283

Subject Blue greenLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A blue-green release moves the `ledger` service to green, which writes to a NEW table layout (`ledger_v2`); blue writes the old layout (`ledger_v1`). To allow rollback, the team cuts over to green at 12:00 but keeps blue's pods on standby. At 12:25 a bug is found in green; they roll back to blue at 12:28. Now reconciliation alerts fire: ~25 minutes of transactions written by green to `ledger_v2` are invisible to blue (which only reads/writes `ledger_v1`), so balances are wrong and some transactions appear lost. Dashboards: green wrote ~12k rows to `ledger_v2` during the window; `ledger_v1` has a 25-minute gap; no app errors. How do you triage, mitigate, and what was the design flaw?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.