On-callHardoc-g489

Subject Service mesh issueLevel Senior–Staff~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Your pods run an Envoy sidecar (Istio-style). Starting 18:30 one service — orders-api — shows a steady ~6% of requests failing with connection resets / 'upstream connect error or disconnect/reset before headers', but only on a rotating subset of pods, and the failures come in short bursts. Dashboards: the app container's own latency and error metrics are clean; the sidecar's memory creeps to its limit and then drops, and `kubectl get pod` shows the istio-proxy container's restart count quietly incrementing (the POD isn't restarting, just the sidecar container). Traffic to orders-api is up ~30% this week after a marketing push. No app deploy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.