Code Room
On-callMedium
Question
Your service calls a vendor API that rate-limits per-account, not per-key. Starting 10:15, your latency-sensitive realtime calls to the vendor start getting 429s intermittently — about 1 in 6 — and customers feel it as occasional slow page loads. Your overall call rate to the vendor is well under the documented per-second limit on average. Dashboards: the 429s arrive in tight bursts every few minutes, perfectly correlated with your nightly-style bulk export job that a teammate rescheduled to run every 5 minutes this morning. How do you triage and mitigate?
What a strong answer looks like
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.
Learn the concepts
Loading whiteboard…
Run or narrate your approach, then ask the coach.