On-callHardoc-g250

Subject SaturationLevel Senior–Staff~35 minCommon in Networking & APIs interviewsIndustries Technology, Software development

Question

An API gateway pod begins throwing intermittent 'cannot assign requested address' / connect failures to a backend after a refactor that 'simplified' the HTTP client by creating a fresh client per outbound call instead of reusing a pooled one. Errors climb with traffic and disappear after a pod restart, then return. CPU/mem are fine. `ss`/`netstat` shows tens of thousands of sockets in TIME_WAIT to the backend's IP:port, and the count tracks request volume. No backend change. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.