On-callMediumoc-g457

Subject Slow dependencyLevel Mid–Senior~35 minCommon in Code quality & review interviewsIndustries Technology, Software development

Question

A service's p99 develops a regular sawtooth: a small latency bump every ~10 seconds, exactly on a fixed cadence, plus occasional flapping where the pod briefly leaves and re-enters the load-balancer rotation. CPU/mem fine, no real traffic pattern at 10s. You discover the Kubernetes readiness/liveness probe and an internal `/health` endpoint run every 10 seconds — and that `/health` does a 'deep' check that runs a real query against the primary database and pings two downstream dependencies synchronously, using the same worker thread pool as live requests. One of those downstreams is sometimes slow. How do you triage and fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.