Code Room
On-callMediumoc-g205
Subject Partial outageLevel Mid–Senior~35 minCommon in Networking & APIs · Reliability & on-call interviewsIndustries Technology, Software development

Question

You run active-active in three regions (us-east, us-west, eu-west). At 16:20, ~30% of users report total failure while others are fine. Dashboards: us-east error rate is 100%, the other two regions are at baseline; us-east's regional API gateway is returning connection-refused; a cloud-provider status page just posted a networking incident in us-east-1; your global DNS is still routing ~1/3 of traffic to us-east via latency-based routing. No deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.