On-callMediumoc-g484

Subject Dependency upgradeLevel Mid–Senior~35 minCommon in Networking & APIs · Code quality & review interviewsIndustries Technology, Software development

Question

A deploy bumps the HTTP client library (minor version) used to call a downstream payments gateway. Tests pass. After deploy, p99 latency on the gateway call doubles and the number of TCP connections the service opens to the gateway jumps ~10x, tripping the gateway's per-client connection limit so a small fraction of calls get refused at peak. Dashboards: the gateway itself reports normal processing latency and is under-utilized; the spike is all in connection setup (TLS handshakes) on our side; CPU is up on our pods from the extra handshakes. The library's minor changelog mentions 'connection pooling is now opt-in for safety.' Triage, explain the mechanism, and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.