Code Room
On-callHard
Question
Your Go microservice fleet (40 pods, each with a pgx pool sized at 25) talks to a single Postgres primary with max_connections=500. At 09:30 a downstream payments API started responding in 8s instead of 80ms. Within four minutes your service's p99 latency went from 120ms to 30s, and you're now serving 503s. Postgres shows 500/500 connections used, almost all in 'idle in transaction' state. No deploy went out in the last 12 hours. There were no errors in the DB itself. How do you triage and mitigate, and what's the durable fix?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.