On-callMediumoc-g675

Subject Networking tcpLevel Mid–Senior~30 minCommon in Networking & APIs interviewsIndustries Technology

Question

Your payments API has seen p99 latency creep from 80ms to 1.4s over the last hour, with no change in request rate or CPU. The service itself reports its own handler time is still ~70ms; the latency is between the load balancer and the upstream pods. Network dashboards for one of three availability zones show TCP retransmits climbing from ~0.01% to 3.2% of segments, and `netstat -s` on the affected hosts shows a rising count of "segments retransmitted" and "fast retransmits." Throughput per connection has collapsed. No deploy went out; a ToR (top-of-rack) switch in that AZ was replaced during a maintenance window two hours ago. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.