Incident response & triage

Stop the bleeding first. Figure out why it happened later.

The idea

When the site goes down (an Incident), the priority is not finding the root cause. The priority is Mitigation (Time To Mitigate - TTM): restoring service for users as fast as possible. If the database is crashing because of a bad new feature, you don't debug the code—you immediately Rollback to the previous version.

Only after the site is stable do you take your time to investigate the Root Cause (Time To Resolve - TTR). Finally, you write a Blameless Postmortem to ensure systems prevent it from happening again.

Incident Timeline T+0 min

Alert Fires Mitigation (TTM) Root Cause (TTR)

PagerDuty Alert: 500 Errors spiking! Site is down.

How it works (The Triage Path)

1. Alert & Assess
   - Acknowledge page. Check dashboards. Impact: SEV-1 (Site Down).

2. Mitigate (Stop the bleeding)
   - Do NOT read code. Do NOT deploy fixes.
   - Action: "Revert the last deployment" or "Turn off the feature flag."
   - Communicate: "Site is recovering. We are monitoring."

3. Root Cause (Resolve)
   - Now that users are happy, take 3 hours to debug the actual bug.
   - Deploy the proper fix.

4. Blameless Postmortem
   - Why did the system allow this bug to reach production?
   - Action item: Add an automated test for this specific edge case.