Question
Your app calls a vendor LLM API. The vendor enforces both a requests-per-minute limit AND a max-concurrent-requests limit. At 15:00 you start getting intermittent 429s with the message 'concurrency limit exceeded' — not RPM. Your requests-per-minute is comfortably under the documented RPM cap. Dashboards: your average concurrent in-flight requests to the vendor is fine, but you see brief concurrency spikes whenever a few long-running (high-token) generations overlap with normal short ones. No traffic change; you shipped a feature yesterday that allows much longer outputs. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.