Question
At 15:45 your mobile API's error rate jumps to 12% and a specific endpoint, `GET /v2/home-feed`, starts returning 500s for a slice of users. There was NO code deploy in the last 6 hours. The error logs show a NullPointerException deep in a new 'personalized ranking' code path that's normally dormant. Your feature-flag dashboard shows that 8 minutes before the incident, someone bumped the `personalized_ranking_v2` flag from 5% to 50% rollout. The flag's targeting also references a `user.cohort` attribute that's null for users who signed up before a certain date. The errors correlate exactly with users in that older cohort who are now in the 50% bucket. How do you respond and prevent this?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.