Question
At 18:20 your cloud provider posts a degradation in availability zone us-east-1a; you run active-active across three AZs (1a, 1b, 1c) in that region. Error rate jumps to ~30% and p99 doubles. Health checks show 1a's instances flapping. Your load balancer is supposed to drain unhealthy targets but error rate isn't dropping the way you'd expect. Investigating: connections to your primary Postgres (whose writer happens to be in 1a) are timing out for ~40 seconds at a time, and a chunk of cache nodes in 1a are unreachable, causing a partial cache miss storm onto the DB. Capacity in 1b/1c is at ~90% because they're absorbing 1a's share but were only provisioned for n+1, not a full AZ loss. Walk me through stabilizing a partial-AZ outage and what the durable posture should be.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.