Question
Your services run with an Envoy sidecar per pod (Istio-style mesh). Starting 16:40, one service — payments-api — develops a slow bleed of 502s and connection resets, climbing from 0.1% to ~4% over an hour, but only on certain pods and only intermittently. Dashboards: the app container's own metrics look healthy (low CPU, normal app latency, no app errors); the affected pods' sidecar containers show climbing memory and occasional restarts; mesh control plane is healthy; no app deploy, but the platform team pushed a new mesh config (more telemetry/route rules) yesterday. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.