On-callMediumoc-g407

Subject Cpu saturationLevel Mid–Senior~30 minCommon in Reliability & on-call interviewsIndustries Technology, Software development

Question

A Python (FastAPI/uvicorn) reporting endpoint's CPU climbs from 30% to 95% and p99 goes from 200ms to 9s starting at 10:05, while request rate is flat and DB query time is unchanged. A py-spy flame graph shows ~85% of on-CPU time in JSON serialization and a nested list-membership check (`if x in big_list`) inside a per-row loop. The only change this morning was a 'cosmetic' update that made each report include the full list of all related entities per row instead of just IDs — average rows-per-report didn't change, but the *per-row* payload and the related-entity list got much bigger. Triage and mitigate.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.