Question
Your model-serving platform pulls model artifacts and metadata from a central model registry / artifact store at pod startup. At 14:00 a new model rollout stalls: freshly scheduled pods are stuck in a crash-loop because they can't fetch the artifact — and worse, an unrelated autoscaling event is now spinning up new pods for an already-deployed model, and THOSE are also failing to start. Already-running pods (which loaded their artifact earlier) keep serving fine. Dashboards: the registry/artifact-store API is returning 500s and timeouts; its own status shows a backing-storage degradation. Error rate on live serving is still low because warm pods are unaffected — but capacity can't grow and no deploy can proceed. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.