On-callHardoc-g551

Subject On callLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

At 20:30 the payments service has a 4-second blip — a downstream fraud-scoring dependency had a brief GC pause and returned slow. By 20:32 the whole platform is on fire: the API gateway is at 100% error rate, every service's thread pools are exhausted, and the fraud-scoring service — which recovered from its GC pause at 20:30:04 — is now being hammered at 50x its normal request rate and can't stay up. Request counts to fraud-scoring are far higher than the actual user traffic, which is flat. Dashboards show retry counts across services spiking into the millions. A small, transient dependency hiccup has turned into a total outage that won't self-heal even though the original cause is gone. Triage this and explain why it's not recovering.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.