Question
Your image-captioning inference service runs on A10 GPUs behind a queue. At 09:20 the pod restart rate spikes: replicas are CUDA-OOM-killing and getting OOMKilled by the GPU memory limit, then the queue backs up while they reload the model. Dashboards show requests per second is normal but the p95 input token/image-size distribution shifted right an hour ago, GPU memory now sawtooths up to the limit before each crash, and successful-request rate is dropping. No code deploy in two days; a partner started sending high-resolution images this morning. How do you triage and stabilize?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.