New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1756-Added downtime incident management document #1781
Changes from 3 commits
d5e3c66
a063cf0
4a4c03e
7d91689
b65af59
3d3e142
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
# Downtime Incident Management Runbook | ||
|
||
Our team has agreed upon steps for handling incidents that cause our site to go offline or become unusable for users. For this document, an incident refers to one in which manage.get.gov is offline or displaying error 400/500 HTTP errors on all pages and is caused by a critical bug in our code, not to be confused with a security incident. This document should not be used for security incident response. | ||
|
||
## Response management rules | ||
|
||
The following set of rules should be followed while an incident is in progress. | ||
|
||
- The person who first notices that the site is down is responsible for using @here and notifying in #dotgov-announce that production is down | ||
PaulKuykendall marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fullstops after every bullet point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what? Chaotically switching between having periods and not having periods doesn't make sense? Will change this |
||
- This applies to any team member, including new team members and non-developers | ||
- If no engineer has acknowledged the announcement within 10 minutes, whoever discovered the site was down should call each developer via the Slack DM huddle feature. If there is no response, this should escalate to a phone call. | ||
- When calling, go down the [phone call list](https://docs.google.com/document/d/1k4r-1MNCfW8EXSXa-tqJQzOvJxQv0ARvHnOjjAH0LII/edit) from top to bottom until someone answers who is available to help. | ||
- If this incident occurs outside of regular working hours, choosing to help is on a volunteer basis, and answering a call doesn't mean an individual is truly available to assist. | ||
- Once an engineer is online, they should immediately start a huddle in the #dotgov-redalert channel to begin troubleshooting | ||
- All available engineers should join the huddle once they see it. | ||
- If downtime occurs outside of working hours, team members who are off for the day may still be pinged and called but are not required to join if unavailable to do so. | ||
- Uncomment the banner on get.gov, so it is transparent to users that we know about the issue on manage.get.gov | ||
h-m-f-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Designers or Developers should be able to make this change; if designers are online and can help with this task, that will allow developers to focus on fixing the bug. | ||
- If the problem is not solved within three hours, change the rules on Cloudflare's admin site so that navigating to manage.get.gov redirects users to get.gov. This will help them see the banner on get.gov informing them that this is a known problem | ||
h-m-f-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Post Incident | ||
|
||
The following checklist should be followed after the site is back up and running. | ||
|
||
- [ ] Turn off the redirect (if used) and verify functionality | ||
h-m-f-t marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- [ ] Remove the banner on get.gov by commenting it out | ||
- [ ] Write up what happened and when. If the cause is already known, write that as well. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can put some effort into an "incident" template to be used during the outage, to capture the details as they happen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps we pin the template to the redalert channel, so the reporter can grab it and start logging information. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a template already in progress |
||
- [ ] If the cause is not known yet, developers should investigate the issue as the highest priority task. | ||
- [ ] As close to the event as possible, such as the next day, perform a team incident retro that is an hour long. The goal of this meeting should be to inform all team members what happened and what is being done now and to collect feedback on what could have been done better |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the "critical bug in our code," we won't know the cause at the outset, only that the site is down. Also, it needs to be clear how this is different from a security incident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the point here is when there is only something critical in our code, but I could add in or third parties. The point being I don't want to make this doc about defining what a security attack could be, etc. This is about things in our control directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language updated see what you think now