On-callHardoc-g291

Subject Mtls failureLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology

Question

You run mutual TLS between internal services with short-lived (1-hour) certs issued by an internal CA. At 04:50 a single node — node-37 — starts rejecting every inbound mTLS connection with 'certificate is not yet valid' (notBefore in the future), while every other node is fine. Affected callers retry and shed to other nodes, so customer impact is small but error budget is bleeding. Dashboards: node-37's clock shows ~90 seconds ahead of the fleet; NTP status on it reads 'unsynchronized.' The issuing CA and all other nodes are healthy. There was no deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.