On-callHardoc-g002

Subject Database incidentsLevel Senior–Staff~40 minCommon in Databases & SQL · Reliability & on-call interviewsIndustries Technology, Software development

Question

Your Go microservice fleet (40 pods, each with a pgx pool sized at 25) talks to a single Postgres primary with max_connections=500. At 09:30 a downstream payments API started responding in 8s instead of 80ms. Within four minutes your service's p99 latency went from 120ms to 30s, and you're now serving 503s. Postgres shows 500/500 connections used, almost all in 'idle in transaction' state. No deploy went out in the last 12 hours. There were no errors in the DB itself. How do you triage and mitigate, and what's the durable fix?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.