Question
The auth service has a brief 4-second hiccup at 09:30 (a leader election in its datastore). Instead of recovering, it gets dramatically worse: request volume to it suddenly triples and stays there, latency stays high, and it keeps falling over every time it tries to come back. Dashboards: inbound RPS to auth is 3x baseline though product traffic is flat; the gateway shows a surge of retried requests; each auth recovery attempt is immediately re-buried. The clients (mobile + 6 internal services) all retry failed auth calls 3x with a fixed 200ms delay. Explain the dynamics and how you break the cycle.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.