On-callHardoc-g526

Subject Thundering herdLevel Senior–Staff~30 minCommon in Security interviewsIndustries Technology

Question

Your fleet authenticates to a downstream by caching a service-to-service OAuth access token (1-hour TTL) shared per instance. After a synchronized full-fleet restart this morning (a coordinated config rollout), everything ran fine — until almost exactly one hour later, when your auth/token endpoint got a sharp 60-second spike that maxed it out, your downstream calls briefly failed with auth errors, and then it cleared. No traffic spike, no deploy at that moment. How do you triage and prevent recurrence?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.