On-callHardoc-g488

Subject Mtls failureLevel Senior–Staff~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your internal mesh uses short-lived (1h) mTLS certs auto-rotated by a sidecar agent. At 03:00 a broad swath of services across one whole AZ start rejecting inbound mTLS with 'certificate is not yet valid' (notBefore in the future), and the error rate in that AZ climbs to ~20% and keeps oscillating with the rotation cadence. Other AZs are clean. Dashboards: every freshly-issued cert is fine the moment it lands, but callers in the affected AZ reject it; the issuing CA is healthy. NTP dashboards show the time source those nodes peer with started serving time ~120s *behind* real time around 02:55. There was no deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.