On-callHardoc-g561

Subject On callLevel Senior–Staff~45 minCommon in Databases & SQL · Reliability & on-call · Distributed systems interviewsIndustries Technology

Question

At 07:30 your e-commerce order service starts showing impossible data: some orders have `total_amount` that doesn't match the sum of their line items, and a few hundred orders reference `customer_id`s that don't exist. The app is up. Digging in: a deploy at 06:55 shipped a refactor of the order-write path that, under a specific concurrent-checkout race, wrote orders with partially-applied updates (it split one transaction into two non-atomic writes). New bad rows are STILL being created as traffic flows. Your DB has PITR available, and a read replica is in sync. The bad and good rows are interleaved in time (legit orders are being placed continuously alongside the corrupt ones). Walk through containment, recovery of the corrupted rows, and prevention — note what makes this harder than a single bad batch UPDATE.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.