Question
It's 14:10 and PagerDuty fires: "read-replica lag > 90s" on your Postgres fleet (1 primary, 3 streaming replicas) backing a social app. Users report stale data — they post a comment, refresh, and it's gone for a minute. The lag dashboard shows replica-2 and replica-3 climbing steadily to 180s while replica-1 is fine. Primary write TPS is up ~40% since a marketing email went out 20 minutes ago. `pg_stat_replication` shows `write_lag` near zero but `replay_lag` ballooning; the replicas' CPU is pinned at 100% on a single `startup`/recovery process. Disk and network on the replicas look healthy. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.