Code Room
On-callHard
Question
Your service calls a cloud vendor API with a per-minute quota you normally use ~30% of. At 11:40 you start getting 429 'quota exceeded' from the vendor and the affected feature errors at ~50%. Dashboards: your OUTBOUND call rate to the vendor jumped 4x starting at 11:38, but inbound user traffic is flat. Around 11:37 the vendor had a 90-second blip of 503s (now recovered). Your client retries failed vendor calls up to 5 times with a fixed 200ms backoff. No deploy. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.