Code Room
On-callMedium
Question
Your search API calls an upstream pricing service. At 16:00 the pricing service has a localized slowdown — its p99 goes from 80ms to 1.5s for about 15 minutes. During that window your search API, which also serves many requests that don't even need pricing, sees its overall p99 blow up and timeouts spread to unrelated endpoints. Dashboards: your service's HTTP connection pool to pricing is fully saturated/exhausted; thread pool utilization is at 100%; pricing-service errors are modest, mostly slowness. No deploy. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.