Question
It's 14:10. Your Rails monolith reads from a pool of three Postgres read replicas behind a load balancer. Support is paging: users report that items they just added to their cart 'disappear' on the next page load, and a few report seeing other people's stale data. The pg_stat_replication view shows replica-2 at 47 seconds of lag and climbing; replica-0 and replica-1 are under 1 second. Grafana shows replica-2's CPU pegged at 100% and disk write IOPS saturated. A bulk-import job that rewrites the products table was kicked off by the merchandising team at 13:55. Walk me through how you triage and mitigate this.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.