Code Room
System designHard
Question
Design the partition-assignment and rebalancing protocol for a large consumer group on a partitioned event stream. 2,000 partitions, 200 consumer instances, autoscaling so consumers join/leave every few minutes. Problem: every join/leave currently triggers a 'stop the world' rebalance where all consumers pause and reshuffle, causing latency spikes and reprocessing. Make rebalances cheap and minimize partition movement while keeping assignment balanced.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.