On-callMediumoc-g456

Subject P99 regressionLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

Every deploy of your service produces a clean, identical pattern: for the first ~2-3 minutes after each pod takes traffic, p99 to a downstream database (via PgBouncer) is 5-8x normal, then it settles to baseline and stays there. p50 is only mildly affected. It's not JIT (the app is Go) and CPU/GC aren't the issue. During the bad window, downstream connection-establishment and TLS-handshake counts are very high, then drop to near zero once it settles. The connection pool is configured but starts empty on each new pod. How do you triage and fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.