On-callMediumoc-g346

Subject Inconsistent replicasLevel Mid–Senior~35 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology

Question

Your app writes to a Postgres primary and routes reads to replicas, with a rule: for ~2s after a user writes, route *their* reads to the primary so they see their own change (read-your-writes). After a deploy at 16:00 that moved this routing decision behind a new load balancer, users intermittently report that a setting they just saved 'didn't save' — then it's fine on the next reload. Dashboards: replica lag is normal (~200ms), error rate flat, write success rate 100%. How do you triage, restore read-your-writes, and confirm no writes were actually lost?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.