Question
Your mesh does automatic mTLS between sidecars with certs rotated by the control plane's secret-discovery service (SDS). At 12:00 you pushed a routine PeerAuthentication policy change tightening mTLS to STRICT mesh-wide. Over the next 15 minutes, errors climb to 25% — but only between certain service pairs, and only for older pods. Dashboards: newer pods talk fine; older pods (started before a recent control-plane upgrade) get 'connection terminated' / TLS errors when called by STRICT-mode callers; control plane is healthy; sidecar proxies on old pods show stale config not yet reconciled. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.