System designHardsd-g643

Subject CoordinationLevel Senior–Staff~45 minCommon in Distributed systems interviewsIndustries Technology, Software development

Question

Design a cluster membership & failure-detection coordination layer for a 500-node stateful service that needs a consistent-enough view of "who is alive" to drive sharding and request routing, updating within a few seconds of a node death. Constraints: detection must scale (no all-to-all heartbeat storm), transient network blips must not cause mass false-positive evictions (flapping), and concurrent nodes must converge on the membership view. Describe the detection mechanism, how nodes coordinate the shared view, and the false-positive trade-off.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.