On-callMediumoc-g558

Subject On callLevel Entry–Mid~30 minCommon in Reliability & on-call interviewsIndustries Technology

Question

At 15:45 your mobile API's error rate jumps to 12% and a specific endpoint, `GET /v2/home-feed`, starts returning 500s for a slice of users. There was NO code deploy in the last 6 hours. The error logs show a NullPointerException deep in a new 'personalized ranking' code path that's normally dormant. Your feature-flag dashboard shows that 8 minutes before the incident, someone bumped the `personalized_ranking_v2` flag from 5% to 50% rollout. The flag's targeting also references a `user.cohort` attribute that's null for users who signed up before a certain date. The errors correlate exactly with users in that older cohort who are now in the 50% bucket. How do you respond and prevent this?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.