Question
Your payments API has seen p99 latency creep from 80ms to 1.4s over the last hour, with no change in request rate or CPU. The service itself reports its own handler time is still ~70ms; the latency is between the load balancer and the upstream pods. Network dashboards for one of three availability zones show TCP retransmits climbing from ~0.01% to 3.2% of segments, and `netstat -s` on the affected hosts shows a rising count of "segments retransmitted" and "fast retransmits." Throughput per connection has collapsed. No deploy went out; a ToR (top-of-rack) switch in that AZ was replaced during a maintenance window two hours ago. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.