Question
Design a cluster membership & failure-detection coordination layer for a 500-node stateful service that needs a consistent-enough view of "who is alive" to drive sharding and request routing, updating within a few seconds of a node death. Constraints: detection must scale (no all-to-all heartbeat storm), transient network blips must not cause mass false-positive evictions (flapping), and concurrent nodes must converge on the membership view. Describe the detection mechanism, how nodes coordinate the shared view, and the false-positive trade-off.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.