Question
Your daily batch feature pipeline (Airflow + Spark) materializes aggregate features (e.g., 'avg_session_len_7d') into the online store every morning before the model uses them. Today a 'feature completeness' check fires: ~15% of users have features computed from only partial input data. Dashboards: one upstream source table (clickstream) landed 4 hours late and incomplete today due to an upstream outage, but the feature DAG ran on schedule at its usual time and published partial aggregates anyway, overwriting yesterday's good values. The model is now serving on these thin/partial features. Triage and respond.
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.