Code Room
System designHard
Question
Design a barrier / fan-in-fan-out compute coordinator for a batch analytics job: a coordinator splits work into ~10,000 parallel tasks across a worker fleet, then a downstream aggregation stage must start only after ALL tasks complete (a barrier). Constraints: workers are preemptible and can die, the job must finish despite stragglers, and the barrier must not release early or hang forever. Describe how tasks are dispatched and tracked, how completion is detected across concurrent workers, and how stragglers/failures are handled.
What a strong answer looks like
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.