On-callHardoc-g493

Subject Service mesh issueLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your mesh rotates per-sidecar mTLS certs via the control plane's secret-discovery service (SDS). Roughly every hour you get a ~20-second burst of mTLS handshake failures ('tls: bad certificate' / 'certificate signed by unknown authority') between certain service pairs, then it self-heals. Dashboards: the bursts align exactly with cert-rotation windows; the control plane is healthy; failures cluster on the newest pods. The root CA was rotated yesterday (a new signing key added to the trust bundle) as part of a planned CA migration. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.