On-callHardoc-g248

Subject Retry amplificationLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A brief 5-second blip in a downstream auth service (now fully recovered) has left your whole API tier wedged: error rate is stuck at 60%, the auth service is pegged at 100% CPU even though the underlying fault is gone, and traffic into your tier is normal. Restarting individual API pods doesn't help — they re-wedge within seconds. Each client SDK retries failed auth calls 3x with fixed 200ms backoff, and your API layer also retries auth 2x on its own. The system won't recover on its own even though nothing is currently broken. How do you triage and break the loop?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.