Question
You run mutual TLS between internal services with short-lived (1-hour) certs issued by an internal CA. At 04:50 a single node — node-37 — starts rejecting every inbound mTLS connection with 'certificate is not yet valid' (notBefore in the future), while every other node is fine. Affected callers retry and shed to other nodes, so customer impact is small but error budget is bleeding. Dashboards: node-37's clock shows ~90 seconds ahead of the fleet; NTP status on it reads 'unsynchronized.' The issuing CA and all other nodes are healthy. There was no deploy. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.