System designMediumsd-g197

Subject Data pipelinesLevel Mid–Senior~40 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A company runs ~2000 interdependent data jobs as a DAG: a late or failed upstream job cascades into hundreds of downstream tables missing their morning SLA, and on-call wastes hours figuring out which root failure caused the storm of downstream alerts. Jobs have different priorities (a few are board-report-critical, most are not). Design an orchestration system that models dependencies, handles failures and retries sensibly, prioritizes critical paths, and gives clear SLA/root-cause visibility instead of alert storms.

What a strong answer looks like

Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.

Learn the concepts

Narrate your design

Loading whiteboard…

Run or narrate your approach, then ask the coach.