On-callMediumoc-g186

Subject Memory leakLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Software development

Question

A Python (Gunicorn + Flask) recommendation API has workers that grow from 200MB to 2GB RSS over ~6 hours and get killed by the orchestrator, dropping in-flight requests. Each worker handles all routes. Dashboards show RSS climbing monotonically; `tracemalloc` top-stats taken an hour apart attribute the growth to a module-level dict used to memoize per-user feature vectors with a hand-rolled `@cache` decorator that has no eviction. Request volume is steady and the set of active users is bounded, but user IDs include a high-cardinality experiment bucket suffix appended last week. Triage and fix.

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.