Code Room
On-callHard
Question
A backend service's p50 latency is normal but p99.9 has tripled over two days; a small set of RPCs hang for ~30s then succeed on retry. Dashboards: aggregate error rate is ~0%; one host shows TCP retransmits climbing and NIC RX errors trickling up; that host passes its health check and CPU/memory look fine; the load balancer keeps sending it traffic. The slowness is intermittent and only on requests that happen to route through that host. No deploy. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.