On-callHardoc-g300

Subject Service mesh issueLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your mesh does automatic mTLS between sidecars with certs rotated by the control plane's secret-discovery service (SDS). At 12:00 you pushed a routine PeerAuthentication policy change tightening mTLS to STRICT mesh-wide. Over the next 15 minutes, errors climb to 25% — but only between certain service pairs, and only for older pods. Dashboards: newer pods talk fine; older pods (started before a recent control-plane upgrade) get 'connection terminated' / TLS errors when called by STRICT-mode callers; control plane is healthy; sidecar proxies on old pods show stale config not yet reconciled. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.