Code Room
On-callHard
Question
Your mesh rotates per-sidecar mTLS certs via the control plane's secret-discovery service (SDS). Roughly every hour you get a ~20-second burst of mTLS handshake failures ('tls: bad certificate' / 'certificate signed by unknown authority') between certain service pairs, then it self-heals. Dashboards: the bursts align exactly with cert-rotation windows; the control plane is healthy; failures cluster on the newest pods. The root CA was rotated yesterday (a new signing key added to the trust bundle) as part of a planned CA migration. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.