Code Room
On-callHard
Question
After enrolling a service into the service mesh (Envoy sidecar injected), end-to-end p99 rose ~40ms and a small but steady fraction of requests now fail with 503 'upstream connect error' even though the application itself logs success and is healthy. The 503s and the added latency correlate with high request concurrency. App CPU is fine, but the sidecar container is hitting its CPU limit and you see Envoy circuit-breaker / pending-request-overflow stats incrementing. No application or downstream change — only the mesh enrollment. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.