Question
Your CI/CD platform integrates with the GitHub API to fetch repos, check statuses, and post comments. At 14:00 builds start stalling and PR status checks stop updating. Dashboards: GitHub API responses shift to HTTP 403 with 'API rate limit exceeded' and 'You have exceeded a secondary rate limit'; the X-RateLimit-Remaining header is at 0; your fleet shares a single GitHub App installation token across all customers' builds. Recent context: you onboarded several large enterprise customers this week, multiplying concurrent builds; nothing else changed. How do you triage and mitigate?
Stop the bleeding first (mitigate), then form hypotheses from real signals. Separate root cause from symptom, communicate status as you go, and close with what prevents a repeat.