Question
Users report that after saving a profile change, refreshing sometimes shows the OLD value, then it 'fixes itself' a few seconds later. Started 30 minutes ago. Dashboards: write traffic is normal; one of three Postgres read replicas shows replication lag oscillating 3–15s (the other two are <500ms); the load balancer routes reads round-robin across all three; the lagging replica's `WALReceiver` is fine but its disk write latency doubled after a noisy co-tenant landed on that host. No app deploy. How do you triage, stop the bad user experience now, and prevent stale reads going forward?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.