Question
You ship a routine deploy of your JVM service at evening peak using a rolling update with maxSurge=50%, maxUnavailable=0. Within a minute p99 spikes from 120ms to 5s, error rate hits 8%, and upstream circuit breakers trip. CPU on the newly rolled pods is pinned at 100% while old pods are fine. The deploy keeps progressing batch by batch, and each new batch produces a fresh latency spike — a rolling wave. Cold pods take ~90s to JIT-compile hot paths and warm connection pools before they perform normally; meanwhile load balancers send them a full share of peak traffic immediately. Last week's off-peak deploy was clean. How do you triage and stabilize?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.