Question
A service's p99 develops a regular sawtooth: a small latency bump every ~10 seconds, exactly on a fixed cadence, plus occasional flapping where the pod briefly leaves and re-enters the load-balancer rotation. CPU/mem fine, no real traffic pattern at 10s. You discover the Kubernetes readiness/liveness probe and an internal `/health` endpoint run every 10 seconds — and that `/health` does a 'deep' check that runs a real query against the primary database and pings two downstream dependencies synchronously, using the same worker thread pool as live requests. One of those downstreams is sometimes slow. How do you triage and fix?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.