Question
A Go payments-orchestration service slowly climbs toward its 65,536 fd soft limit over ~30 hours and then starts failing outbound calls with 'dial tcp: too many open files', forcing an hourly restart that masked the issue until this week. The open-fd graph rises in a near-perfect straight line that is *independent of traffic* — it keeps climbing at the same slope even during the overnight trough. `lsof` shows thousands of sockets in CLOSE_WAIT to one internal dependency, and a heap profile shows a growing number of live `http.Transport` objects. The only recent change is a refactor two weeks ago that moved HTTP calls into a per-request helper. How do you triage this and stop the bleed?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.