Question
A JVM service deploys with a rolling update. Each time a new pod comes up, there's a 20–30 second window where requests routed to it return elevated 503s and very high latency, then that pod normalizes. Across the full rollout these windows stack up into a noticeable error-rate bump and a few breached SLOs. Dashboards: per-new-pod, the first ~25s show high latency + 503s, then flat; the readiness probe is a TCP check on the port; the pod accepts the port (and thus traffic) immediately at process start, before JIT warmup / connection-pool init / first-request class loading complete. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.