On-callMediumoc-g001

Subject Database incidentsLevel Mid–Senior~30 minCommon in Databases & SQL · Reliability & on-call · Distributed systems interviewsIndustries Technology, Software development

Question

It's 14:10. Your Rails monolith reads from a pool of three Postgres read replicas behind a load balancer. Support is paging: users report that items they just added to their cart 'disappear' on the next page load, and a few report seeing other people's stale data. The pg_stat_replication view shows replica-2 at 47 seconds of lag and climbing; replica-0 and replica-1 are under 1 second. Grafana shows replica-2's CPU pegged at 100% and disk write IOPS saturated. A bulk-import job that rewrites the products table was kicked off by the merchandising team at 13:55. Walk me through how you triage and mitigate this.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.