Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1756-Added downtime incident management document #1781

Merged
merged 6 commits into from Mar 5, 2024

Conversation

abroddrick
Copy link
Contributor

@abroddrick abroddrick commented Feb 14, 2024

Ticket

Resolves #1756

Changes

  • adds document for managing when the site is down
  • links to a document for phone numbers

Context for reviewers

Setup

Code Review Verification Steps

As the original developer, I have

Satisfied acceptance criteria and met development standards

  • Met the acceptance criteria, or will meet them in a subsequent PR
  • Created/modified automated tests
  • Added at least 2 developers as PR reviewers (only 1 will need to approve)
  • Messaged on Slack or in standup to notify the team that a PR is ready for review
  • Changes to “how we do things” are documented in READMEs and or onboarding guide
  • If any model was updated to modify/add/delete columns, makemigrations was ran and the associated migrations file has been commited.

As a code reviewer, I have

Reviewed, tested, and left feedback about the changes

  • Pulled this branch locally and tested it
  • Reviewed this code and left comments
  • Checked that all code is adequately covered by tests
  • Made it clear which comments need to be addressed before this work is merged
  • If any model was updated to modify/add/delete columns, makemigrations was ran and the associated migrations file has been commited.

Ensured code standards are met (Code reviewer)

  • All new functions and methods are commented using plain language
  • Interactions with external systems are wrapped in try/except
  • Error handling exists for unusual or missing values
  • (Rarely needed) Did dependency updates in Pipfile also get changed in requirements.txt?

Copy link

@PaulKuykendall PaulKuykendall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments for you, that I think could be clarified.


- [ ] Turn off the redirect (if used) and verify functionality
- [ ] Remove the banner on get.gov by commenting it out
- [ ] Write up what happened and when. If the cause is already known, write that as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put some effort into an "incident" template to be used during the outage, to capture the details as they happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we pin the template to the redalert channel, so the reporter can grab it and start logging information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a template already in progress

@@ -0,0 +1,29 @@
# Downtime Incident Management Runbook

Our team has agreed upon steps for handling incidents that cause our site to go offline or become unusable for users. For this document, an incident refers to one in which manage.get.gov is offline or displaying error 400/500 HTTP errors on all pages and is caused by a critical bug in our code, not to be confused with a security incident. This document should not be used for security incident response.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the "critical bug in our code," we won't know the cause at the outset, only that the site is down. Also, it needs to be clear how this is different from a security incident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the point here is when there is only something critical in our code, but I could add in or third parties. The point being I don't want to make this doc about defining what a security attack could be, etc. This is about things in our control directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

language updated see what you think now

@abroddrick
Copy link
Contributor Author

@PaulKuykendall, I made updates, re-review when you have a spare moment please

@h-m-f-t
Copy link
Member

h-m-f-t commented Feb 19, 2024

We should also have as a post- todo to review these steps as a team and update as useful.


The following set of rules should be followed while an incident is in progress.

- The person who first notices that the site is down is responsible for using @here and notifying in #dotgov-announce that production is down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fullstops after every bullet point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? Chaotically switching between having periods and not having periods doesn't make sense?

Will change this

Copy link
Member

@h-m-f-t h-m-f-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small update and a request for a test.

docs/operations/runbooks/downtime_incident_management.md Outdated Show resolved Hide resolved
@abroddrick abroddrick dismissed h-m-f-t’s stale review February 29, 2024 17:05

Changes made, requesting re-review

Copy link
Member

@h-m-f-t h-m-f-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@h-m-f-t
Copy link
Member

h-m-f-t commented Mar 1, 2024

@abroddrick – once merged, let's link to this document in redalert's canvas.

@abroddrick abroddrick merged commit d636de7 into main Mar 5, 2024
3 checks passed
@abroddrick abroddrick deleted the 1756-incident-response-playbook branch March 5, 2024 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create an incident response playbook
4 participants