On-callMediumoc-g665

Subject Model serving registry outageLevel Mid–Senior~30 minCommon in ML systems · Reliability & on-call · Code quality & review interviewsIndustries Technology

Question

Your model-serving platform pulls model artifacts and metadata from a central model registry / artifact store at pod startup. At 14:00 a new model rollout stalls: freshly scheduled pods are stuck in a crash-loop because they can't fetch the artifact — and worse, an unrelated autoscaling event is now spinning up new pods for an already-deployed model, and THOSE are also failing to start. Already-running pods (which loaded their artifact earlier) keep serving fine. Dashboards: the registry/artifact-store API is returning 500s and timeouts; its own status shows a backing-storage degradation. Error rate on live serving is still low because warm pods are unaffected — but capacity can't grow and no deploy can proceed. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.