New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1756-Added downtime incident management document #1781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments for you, that I think could be clarified.
|
||
- [ ] Turn off the redirect (if used) and verify functionality | ||
- [ ] Remove the banner on get.gov by commenting it out | ||
- [ ] Write up what happened and when. If the cause is already known, write that as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can put some effort into an "incident" template to be used during the outage, to capture the details as they happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we pin the template to the redalert channel, so the reporter can grab it and start logging information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a template already in progress
@@ -0,0 +1,29 @@ | |||
# Downtime Incident Management Runbook | |||
|
|||
Our team has agreed upon steps for handling incidents that cause our site to go offline or become unusable for users. For this document, an incident refers to one in which manage.get.gov is offline or displaying error 400/500 HTTP errors on all pages and is caused by a critical bug in our code, not to be confused with a security incident. This document should not be used for security incident response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the "critical bug in our code," we won't know the cause at the outset, only that the site is down. Also, it needs to be clear how this is different from a security incident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the point here is when there is only something critical in our code, but I could add in or third parties. The point being I don't want to make this doc about defining what a security attack could be, etc. This is about things in our control directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
language updated see what you think now
@PaulKuykendall, I made updates, re-review when you have a spare moment please |
We should also have as a post- todo to review these steps as a team and update as useful. |
|
||
The following set of rules should be followed while an incident is in progress. | ||
|
||
- The person who first notices that the site is down is responsible for using @here and notifying in #dotgov-announce that production is down |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fullstops after every bullet point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what? Chaotically switching between having periods and not having periods doesn't make sense?
Will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small update and a request for a test.
Changes made, requesting re-review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@abroddrick – once merged, let's link to this document in redalert's canvas. |
Ticket
Resolves #1756
Changes
Context for reviewers
Setup
Code Review Verification Steps
As the original developer, I have
Satisfied acceptance criteria and met development standards
As a code reviewer, I have
Reviewed, tested, and left feedback about the changes
Ensured code standards are met (Code reviewer)