Question
A blue-green release moves the `ledger` service to green, which writes to a NEW table layout (`ledger_v2`); blue writes the old layout (`ledger_v1`). To allow rollback, the team cuts over to green at 12:00 but keeps blue's pods on standby. At 12:25 a bug is found in green; they roll back to blue at 12:28. Now reconciliation alerts fire: ~25 minutes of transactions written by green to `ledger_v2` are invisible to blue (which only reads/writes `ledger_v1`), so balances are wrong and some transactions appear lost. Dashboards: green wrote ~12k rows to `ledger_v2` during the window; `ledger_v1` has a 25-minute gap; no app errors. How do you triage, mitigate, and what was the design flaw?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.