Code Room
On-callHard
Question
A core internal service briefly hiccups at 20:00 (a 5-second blip during a leader election). Instead of recovering, it stays down: it comes back up, gets instantly flooded to many times its normal request volume, falls over again, and this cycle repeats every ~30 seconds. Five upstream services all call it; each retries failed calls 3x immediately, and several have their own callers that also retry. Request volume to the struggling service is now ~8x baseline even though end-user traffic is normal. Triage and break the cycle.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.