Question
A streaming-upload service (Go) that proxies large client uploads to object storage starts showing badly inflated p99 latency on its small, fast control-plane API endpoints — the lightweight `/status` and `/auth` calls that share the same hosts. The big upload streams are healthy and throughput is high; it's the tiny requests that are now slow, with multi-second delays. There was no traffic spike. A 'performance' change last sprint cranked the socket send buffers way up (`SO_SNDBUF`) and increased an internal write-queue size to 'smooth out' uploads. NIC utilization is moderate; CPU and memory are fine. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.