Question
After a coordinated full restart of your TLS-terminating edge fleet (a fast rolling restart that cycled almost all nodes within a couple minutes), you see a few minutes of elevated p99, raised CPU on the edge nodes, and a spike of client handshake errors/timeouts — even though traffic volume is unchanged. Metrics show TLS session-resumption RATE collapsed to near zero right after the restart and then climbed back over ~5 minutes, exactly tracking the error/CPU spike as it subsided. New full handshakes per second jumped enormously during the window. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.