On-callHardoc-g409

Subject Ephemeral port exhaustionLevel Senior–Staff~35 minCommon in Networking & APIs interviewsIndustries Technology

Question

A service mesh sidecar fleet starts intermittently failing outbound calls at peak with 'cannot assign requested address', and the failures cluster on the busiest pods. `ss -s` on a hot pod shows tens of thousands of sockets to a handful of upstreams, mostly TIME_WAIT, and the node's conntrack table is also near its max. Traffic is up because a marketing event doubled load this afternoon, and a recent mesh config change reduced upstream connection idle timeout so connections close and reopen far more aggressively than before. CPU/mem/fds are fine. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.