Code Room
System designHardsd-g451
Subject CoordinationLevel Senior–Staff~45 minCommon in Distributed systems interviewsIndustries Technology

Question

Design cluster membership and failure detection for a fleet of 20,000 stateless service nodes spread across 8 regions, where every node needs a roughly-consistent view of which peers are alive (for client-side load balancing and gossip-based config). A central registry with heartbeats from 20k nodes is a hotspot and a single point of failure; you want a decentralized scheme. Address: how membership propagates without O(N^2) traffic, how you detect a dead node quickly without false-positive flapping during a brief network blip or a GC pause, and how you keep the membership view from oscillating (a flaky node repeatedly marked dead/alive) which would thrash all clients' load-balancing.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Narrate your design
Loading whiteboard…
Run or narrate your approach, then ask the coach.