Question
At 20:30 the payments service has a 4-second blip — a downstream fraud-scoring dependency had a brief GC pause and returned slow. By 20:32 the whole platform is on fire: the API gateway is at 100% error rate, every service's thread pools are exhausted, and the fraud-scoring service — which recovered from its GC pause at 20:30:04 — is now being hammered at 50x its normal request rate and can't stay up. Request counts to fraud-scoring are far higher than the actual user traffic, which is flat. Dashboards show retry counts across services spiking into the millions. A small, transient dependency hiccup has turned into a total outage that won't self-heal even though the original cause is gone. Triage this and explain why it's not recovering.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.