On-callHardoc-g557

Subject On callLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

At 18:20 your cloud provider posts a degradation in availability zone us-east-1a; you run active-active across three AZs (1a, 1b, 1c) in that region. Error rate jumps to ~30% and p99 doubles. Health checks show 1a's instances flapping. Your load balancer is supposed to drain unhealthy targets but error rate isn't dropping the way you'd expect. Investigating: connections to your primary Postgres (whose writer happens to be in 1a) are timing out for ~40 seconds at a time, and a chunk of cache nodes in 1a are unreachable, causing a partial cache miss storm onto the DB. Capacity in 1b/1c is at ~90% because they're absorbing 1a's share but were only provisioned for n+1, not a full AZ loss. Walk me through stabilizing a partial-AZ outage and what the durable posture should be.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.