On-callHardoc-g231

Subject Replication lagLevel Senior–Staff~40 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology, Software development

Question

Postgres logical replication feeds an analytics/replica DB via a single subscription. Lag was near zero, then at 03:00 the replica's apply lag jumped to 40 minutes and keeps growing, even though the primary's write rate is normal and both servers have spare CPU/IO. `pg_stat_replication` shows `write_lag`/`flush_lag` small but `replay_lag` huge; on the subscriber `pg_stat_subscription` shows the apply worker busy. A nightly job on the primary did one giant `UPDATE` touching 80M rows in a single transaction around 03:00. Walk the triage, mitigation, and prevention.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.