Question
Every rolling deploy of a service causes a brief but sharp burst of client-facing 5xx errors and 'connection reset' / 'upstream closed connection' errors at the load balancer — lasting 10-20 seconds per batch of pods replaced — even though the new pods are healthy and the steady state is clean. The errors line up exactly with old pods terminating. The deployment replaces pods in batches; the LB health check passes for new pods quickly. The app receives SIGTERM and exits in ~1s. There's no connection draining configured and no preStop hook. Triage and explain the durable fix.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.