Question
A brief 90-second traffic spike pushed your service past capacity. The spike passed minutes ago, traffic is now back to normal levels, but the service has not recovered — it's stuck at ~70% error rate and high latency. Dashboards show CPU pinned at 100%, the request queue full, clients timing out and retrying aggressively (each failed call retries 3x), and goodput near zero even though arriving original load would fit comfortably under capacity. Restarting a few instances briefly helps, then they re-saturate. There was no deploy. How do you triage, and how do you break the service out of this state?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.