On-callMediumoc-g679

Subject Networking tcpLevel Mid–Senior~30 minCommon in Networking & APIs interviewsIndustries Technology

Question

Clients of an internal gRPC service start seeing a storm of 'connection reset by peer' errors — error rate jumps from ~0 to ~8% — beginning right after a deploy. Captures show the server sending TCP RST packets mid-stream on long-lived connections. The service is healthy by its own metrics (no crashes, normal latency on the requests that do complete), and the error is bursty: clusters of resets every few minutes rather than a steady rate. The deploy added a sidecar proxy in front of each pod and set an idle-connection timeout on it. Connection counts per pod are higher than before. Triage and fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.