Question
Your AI assistant feature calls a hosted LLM API. At 10:30 responses start failing or stalling for many users. Dashboards: the LLM provider returns HTTP 429 with 'tokens per minute (TPM) limit exceeded' and some 'requests per minute exceeded'; your queue of pending generations grows; cost dashboard shows token spend already at 3x the daily average and it's only mid-morning. Recent context: yesterday you shipped a feature that auto-summarizes the user's entire document on page load, and you also raised max_output_tokens to allow longer answers. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.