Code Room
On-callHard
Question
An API gateway pod begins throwing intermittent 'cannot assign requested address' / connect failures to a backend after a refactor that 'simplified' the HTTP client by creating a fresh client per outbound call instead of reusing a pooled one. Errors climb with traffic and disappear after a pod restart, then return. CPU/mem are fine. `ss`/`netstat` shows tens of thousands of sockets in TIME_WAIT to the backend's IP:port, and the count tracks request volume. No backend change. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.