Code Room
On-callHard
Question
Your fleet authenticates to a downstream by caching a service-to-service OAuth access token (1-hour TTL) shared per instance. After a synchronized full-fleet restart this morning (a coordinated config rollout), everything ran fine — until almost exactly one hour later, when your auth/token endpoint got a sharp 60-second spike that maxed it out, your downstream calls briefly failed with auth errors, and then it cleared. No traffic spike, no deploy at that moment. How do you triage and prevent recurrence?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.