On-callHardoc-g469

Subject Deploy incidentsLevel Senior–Staff~40 minCommon in Security · Reliability & on-call interviewsIndustries Technology, Software development

Question

A signing key used to mint internal service-auth JWTs is rotated in the secret store at 02:00; the old key remains valid for verification for 24h (graceful overlap), and the new key is what services should sign with. Most pods are recently restarted, so no incident at 02:00. At 13:00, a slow-growing fraction of inter-service calls start getting rejected with `invalid signature`, and the fraction matches the set of pods that have NOT been restarted since before 02:00. Context: each service reads the SIGNING key once at boot and caches it; verifiers were updated to accept both keys, but old-boot pods are still SIGNING with the now-rotated-out key whose VERIFY-side acceptance just hit its own shorter overlap on one downstream. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.