On-callMediumoc-g611

Subject Replication lagLevel Mid–Senior~35 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology

Question

It's 14:10 and PagerDuty fires: "read-replica lag > 90s" on your Postgres fleet (1 primary, 3 streaming replicas) backing a social app. Users report stale data — they post a comment, refresh, and it's gone for a minute. The lag dashboard shows replica-2 and replica-3 climbing steadily to 180s while replica-1 is fine. Primary write TPS is up ~40% since a marketing email went out 20 minutes ago. `pg_stat_replication` shows `write_lag` near zero but `replay_lag` ballooning; the replicas' CPU is pinned at 100% on a single `startup`/recovery process. Disk and network on the replicas look healthy. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.