On-callHardoc-g442

Subject Noisy neighborLevel Senior–Staff~40 minCommon in Concurrency interviewsIndustries Technology, Software development

Question

A latency-sensitive Go pricing service runs 4 pods per node on a shared Kubernetes cluster. p99 on the service shows random 50-150ms degradations that come and go in 5-20 minute windows, uncorrelated with your own traffic, GC, or any deploy. During the bad windows your own CPU usage, request rate, and allocation rate are all flat and normal. Node-level CPU is at ~65%. You eventually notice the bad windows line up exactly with a colocated batch-analytics pod (a different team's job) scheduled onto the same node running a memory-scan-heavy workload. Both pods have CPU requests/limits set and neither is being CFS-throttled. How do you triage and what's going on?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.