Question
An executive dashboard and a daily-paying-out ML model both depend on a chain of ~30 interdependent jobs across 3 orchestrators (Airflow + a Spark scheduler + a vendor's sync). The business promises the dashboard is 'fresh by 8am.' Lately it's late ~20% of mornings — sometimes a single slow upstream, sometimes a vendor sync that finished but produced empty data, sometimes a retry storm. On-call gets paged at 8am with no idea which of 30 jobs is the culprit or whether the data is even trustworthy. Design a data-SLA / freshness system that guarantees and observes the 8am promise across heterogeneous orchestrators.
Clarify scale and constraints first. Propose a clean component breakdown, then go deep on the hard parts — data model, bottlenecks, consistency, failure modes — and name the trade-offs you are making.