Question
You own the fraud-scoring feature pipeline. A 'feature freshness' SLO alarm fires: the 'transactions_last_1h' and 'velocity_5m' features feeding the online model are now 3 hours stale and climbing. The model still scores every request (no errors), but the fraud team reports a quiet uptick in approved chargebacks. Dashboards: the streaming feature job's processed-event lag is climbing linearly since 06:00, its consumer-group offset is stuck, and the upstream Kafka topic's produce rate is normal. A platform team rotated Kafka broker certs at 06:00. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.