Question
A latency-sensitive Go pricing service runs 4 pods per node on a shared Kubernetes cluster. p99 on the service shows random 50-150ms degradations that come and go in 5-20 minute windows, uncorrelated with your own traffic, GC, or any deploy. During the bad windows your own CPU usage, request rate, and allocation rate are all flat and normal. Node-level CPU is at ~65%. You eventually notice the bad windows line up exactly with a colocated batch-analytics pod (a different team's job) scheduled onto the same node running a memory-scan-heavy workload. Both pods have CPU requests/limits set and neither is being CFS-throttled. How do you triage and what's going on?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.