Question
You run an active-active multi-region datastore (region-east, region-west) with asynchronous cross-region replication. At 04:00 the inter-region network link degrades — not a clean cut, but packet loss pushes replication lag from <1s to several minutes and flapping. Dashboards: each region still serves local reads/writes fine, but users who write in one region and read from the other (e.g. mobile clients hitting different edges) see their own writes 'disappear,' and a few cross-region operations conflict. Recent context: a backbone provider reported degradation. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.