Zero-to-One: Standing Up Your Incident Management Program in 15 Minutes
Are you part of a small company that knows you should have an incident management program but keeps putting it off because it seems overwhelming? This blog post is for you. This post will show you how to implement an effective incident management program in 15 minutes. Follow these exact steps, stay focused, and you'll be ready to handle incidents confidently.
Step 0: Required Tools
To move quickly, I have several assumptions about technology in this blog post. You can swap them out with your tool of choice, but from here on out, I’ll be referencing the following:
Slack for team communication.
Google Workspace for spreadsheets and documentation.
You have a Slack alias for your engineering team: @engineering-team
Step 1: Set Up Communication Channels
Instructions:
Create a Slack channel named
#incidents
.
Step 2: Copy and paste my Google Doc starter folder for postmortems
Instructions:
Go to this Google folder and make a copy in your workspace.
Share the folder with your organization.
Add it as a bookmark in your new incidents channel
Step 3: Review the spreadsheet and postmortem template
Instructions:
Open the Incident Tracker spreadsheet. The goal of this spreadsheet is to ensure that postmortem action items are completed.
Open up the example incident titled “2024-04-11 Temporarily Lost All User Data.” This will give you an idea of what a lightweight incident postmortem could look like.
Step 4: Go Live
Instructions:
While most instructions suggest running a mock incident, I think you’ll be OK as long as you’re sure people know how to page you. Beyond that, you can learn as you go.
Send the following two messages to your entire organization and engineering team.
Message to General Team
Invite your entire org to the #incidents channel and send them the following:
Hi Team,
The goal of this message is to give everyone a quick primer on our incident management process. First, some definitons:
- **Incident**: Any event that disrupts or could disrupt service. When in doubt, assume it's an incident.
- **Major Incident**: A disruption affecting more than 50% of users.
For reporting and managing incidents, we will use this channel.
- For Major Incidents, please call my cell phone at 555-555-5555 and if I don't pick up, please call Sarah's cell phone at 555-555-5555. When in doubt, please call. We can calibrate over time,
- For Non-Major Incidents, please tag @engineering-team in this channel and we will triage the incident as quickly as possible.
Again, when in doubt, assume it's an incident.
Note: This message assumes that you don’t already have something like pagerduty set up. If you do, swap up the cell phone instructions with instructions on how to trigger the pager.
Message to Engineering Team
Send the following message to your engineering team slack channel:
Hi Engineering Team,
We are implementing a new incident management system to streamline our response process. Here’s how we will manage incidents:
**Process**:
1. **Responding To An Incident**: Incidents don't happen often so let's be greedy with involving people. When an incident is identified, we should minimally have two people pairing on resolution. You should decide up front who you believe will take the majority of the actions (Responder) and who will ensure that the rest of the organization is kept informed (Incident Commander / Scribe).
2. **Live Updates**: Use the #incidents slack channel for live updates during an incident.
- For now, please be sure to provide updates at least every 15 minutes.
3. **Postmortem**: After resolving an incident, use the following template for postmortem analysis. Feel free to have the postmortem right away or schedule 30 minutes in the coming days: [Insert Postmortem Google Doc Template Link]
Please take 15 minutes to review these materials and be prepared to follow this process during incidents. Feel free to comment on this message for any clarifications.
Step 5: Periodic Spreadsheet Check-ins
To ensure accountability to your postmortem process, as the owner of the incident management process, you should schedule a 15-minute meeting with yourself once per month to review the spreadsheet and ensure the assigned action items are being completed. Expect that you’ll have to nudge people.
That’s it!
By following these steps and using the provided templates and messages, you can quickly establish a simple yet effective incident management program and ensure your team is ready to handle incidents efficiently.
As you go deeper, pagerduty has a fantastic resource to level up various aspects of your incident management program: https://response.pagerduty.com/