Code Room
On-callHardoc-g580
Subject Replication lagLevel Senior–Staff~35 minCommon in Databases & SQL · Distributed systems interviewsIndustries Technology

Question

Your read replicas for the orders database are falling behind the primary: replication lag was a steady ~200ms all week and over the last 40 minutes has climbed to 95 seconds and is still rising. Users report seeing 'order not found' immediately after placing an order (reads routed to replicas). The primary's write throughput is up only ~15%, but a batch backfill job that rewrites a 400M-row table in 5k-row chunks started an hour ago. Replica CPU is low; replica disk write IOPS is pegged at the device ceiling. How do you triage and bring lag back down without losing read-after-write correctness?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.