On-callMediumoc-g565

Subject Feature pipeline stallLevel Mid–Senior~30 minCommon in ML systems interviewsIndustries Technology

Question

You own the fraud-scoring feature pipeline. A 'feature freshness' SLO alarm fires: the 'transactions_last_1h' and 'velocity_5m' features feeding the online model are now 3 hours stale and climbing. The model still scores every request (no errors), but the fraud team reports a quiet uptick in approved chargebacks. Dashboards: the streaming feature job's processed-event lag is climbing linearly since 06:00, its consumer-group offset is stuck, and the upstream Kafka topic's produce rate is normal. A platform team rotated Kafka broker certs at 06:00. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.