Question
Clients of an internal gRPC service start seeing a storm of 'connection reset by peer' errors — error rate jumps from ~0 to ~8% — beginning right after a deploy. Captures show the server sending TCP RST packets mid-stream on long-lived connections. The service is healthy by its own metrics (no crashes, normal latency on the requests that do complete), and the error is bursty: clusters of resets every few minutes rather than a steady rate. The deploy added a sidecar proxy in front of each pod and set an idle-connection timeout on it. Connection counts per pod are higher than before. Triage and fix.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.