On-callHardoc-g040

Subject Retry amplificationLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A core internal service briefly hiccups at 20:00 (a 5-second blip during a leader election). Instead of recovering, it stays down: it comes back up, gets instantly flooded to many times its normal request volume, falls over again, and this cycle repeats every ~30 seconds. Five upstream services all call it; each retries failed calls 3x immediately, and several have their own callers that also retry. Request volume to the struggling service is now ~8x baseline even though end-user traffic is normal. Triage and break the cycle.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.