Question
A latency-sensitive API on a cloud VM instance type develops slow, creeping degradation: it runs perfectly after any restart, then over ~30-60 minutes p99 and packet-level latency climb and connection timeouts appear, and a restart (or moving to a fresh instance) resets it. The degradation correlates with sustained outbound throughput from a colocated batch job that ships large datasets out of the same instance. CPU, memory, and disk are all fine throughout. The instance type advertises a 'baseline' network bandwidth with a burstable allowance. How do you triage and what's happening?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.