Code Room
System designHard
Question
Design the replica-repair strategy for a distributed time-series/metrics store (N=3 replicas per partition, eventual consistency, tunable quorums) ingesting 2M writes/sec. Over time replicas drift: dropped writes, hinted-handoff replays that never landed, nodes that were down for hours. You need replicas to re-converge without re-streaming terabytes of data on every check, and without a foreground read having to fix everything. How do you detect and repair divergence efficiently?
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.