Code Room
On-callHard
Question
After migrating your service-to-service calls from HTTP/1.1 (with a connection pool) to a single multiplexed HTTP/2 connection per upstream 'for efficiency,' your p99 to a critical upstream got worse, not better, under load — and the badness correlates with one specific slow endpoint on that upstream. Fast endpoints on the same upstream now also show elevated p99 whenever the slow endpoint is being hit hard. TCP retransmits are slightly up on the path. There was no change to the upstream itself. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.