On-callMediumoc-g081

Subject Api quotaLevel Mid–Senior~35 minCommon in Reliability & on-call interviewsIndustries Technology

Question

Your AI assistant feature calls a hosted LLM API. At 10:30 responses start failing or stalling for many users. Dashboards: the LLM provider returns HTTP 429 with 'tokens per minute (TPM) limit exceeded' and some 'requests per minute exceeded'; your queue of pending generations grows; cost dashboard shows token spend already at 3x the daily average and it's only mid-morning. Recent context: yesterday you shipped a feature that auto-summarizes the user's entire document on page load, and you also raised max_output_tokens to allow longer answers. How do you triage and mitigate?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.