Code Room
On-callHard
Question
Postgres logical replication feeds an analytics/replica DB via a single subscription. Lag was near zero, then at 03:00 the replica's apply lag jumped to 40 minutes and keeps growing, even though the primary's write rate is normal and both servers have spare CPU/IO. `pg_stat_replication` shows `write_lag`/`flush_lag` small but `replay_lag` huge; on the subscriber `pg_stat_subscription` shows the apply worker busy. A nightly job on the primary did one giant `UPDATE` touching 80M rows in a single transaction around 03:00. Walk the triage, mitigation, and prevention.
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.