Code Room
On-callHardoc-g660
Subject Feature pipeline label delayLevel Senior–Staff~35 minCommon in ML systems · Reliability & on-call interviewsIndustries Technology

Question

Your fraud model retrains nightly on freshly joined labels (a transaction joined to its chargeback/confirmed-fraud outcome). At 08:00 a 'label freshness' alarm fires: the labeled-training table for the last 36 hours is ~90% empty — almost no transactions are getting labels attached — even though the model is still scoring live traffic fine. Dashboards: the raw transactions stream is healthy and the chargeback/outcome events topic is producing normally, but the join job that attaches outcomes to transactions has an output-row count near zero since 20:00 yesterday; its run succeeded (exit 0) and emitted no errors. A schema migration last night renamed the transaction key column from 'txn_id' to 'transaction_id' in the outcomes feed only. How do you triage and respond?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Diagram & narrate the incident
Loading whiteboard…
Run or narrate your approach, then ask the coach.