Question
A deploy bumps the HTTP client library (minor version) used to call a downstream payments gateway. Tests pass. After deploy, p99 latency on the gateway call doubles and the number of TCP connections the service opens to the gateway jumps ~10x, tripping the gateway's per-client connection limit so a small fraction of calls get refused at peak. Dashboards: the gateway itself reports normal processing latency and is under-utilized; the spike is all in connection setup (TLS handshakes) on our side; CPU is up on our pods from the extra handshakes. The library's minor changelog mentions 'connection pooling is now opt-in for safety.' Triage, explain the mechanism, and mitigate.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.