Question
A Go API gateway begins returning sporadic 'dial tcp: too many open files' errors at 11:30, climbing to ~5% of requests by noon. The host has 64k as its fd soft limit. Your dashboard shows open file descriptors on the process climbing linearly all morning (now 61k and rising), goroutine count climbing in lockstep, and outbound connections to a downstream recommendation service in CLOSE_WAIT in the tens of thousands per `ss -s`. The recommendation service was redeployed at 09:00 behind a new load balancer; latency to it rose from 20ms to 350ms after that deploy. No change shipped to the gateway. Describe your triage and the fix.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.