Code Room
On-callHardoc-g292
Subject Service mesh issueLevel Senior–Staff~45 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your services run with an Envoy sidecar per pod (Istio-style mesh). Starting 16:40, one service — payments-api — develops a slow bleed of 502s and connection resets, climbing from 0.1% to ~4% over an hour, but only on certain pods and only intermittently. Dashboards: the app container's own metrics look healthy (low CPU, normal app latency, no app errors); the affected pods' sidecar containers show climbing memory and occasional restarts; mesh control plane is healthy; no app deploy, but the platform team pushed a new mesh config (more telemetry/route rules) yesterday. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.