On-callMediumoc-g321

Subject ThrottlingLevel Mid–Senior~24 minCommon in Reliability & on-call interviewsIndustries Technology

Question

After a config rollout to your API gateway fleet, legitimate users start hitting 429 `Too Many Requests` even at normal, low traffic. The intended global rate limit is 10,000 rps for a tenant. The gateway runs 20 instances. Dashboards show each instance independently rejecting traffic well below the tenant's real rate, and the aggregate 429 rate jumped right after the config push — traffic itself is flat versus yesterday. The new config set a per-instance limit of 500 rps. How do you triage and fix, and what's the durable design change?

What a strong answer looks like

Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.

Learn the concepts

Diagram & narrate the incident

Loading whiteboard…

Run or narrate your approach, then ask the coach.