Add missing incident response process doc

Also update go links pointing to it and remove slack channel links.
tailscale · Aug 30, 2023 · 7639728 · 7639728
1 parent e519c5a
commit 7639728
Show file tree

Hide file tree

Showing 3 changed files with 70 additions and 3 deletions.
diff --git a/bcp-dr/index.md b/bcp-dr/index.md
@@ -34,4 +34,4 @@ An incident could be detected internally by monitoring tools, by an employee in
 
 ### Outage response and remediation
 
-If a suspected outage or other business continuity incident is detected, it should be responded to following the [Incident response process](http://go/incident-response-process).
+If a suspected outage or other business continuity incident is detected, it should be responded to following the [Incident response process](/security-policies/incident-response-process).
diff --git a/incident-response/index.md → incident-response-policy/index.md b/incident-response/index.md → incident-response-policy/index.md
@@ -16,7 +16,7 @@ The following minimum standards apply to Tailscale’s assets as managed by empl
 
 An incident could be detected internally by an employee in their course of work, by an employee or vendor doing a review of Tailscale’s security posture, or an external third party reporting a potential vulnerability to us.
 
-If you see something, say something. All Tailscale employees should immediately report suspected security incidents or suspicious activity that occurs at Tailscale, including but not limited to security incidents, physical injury, theft, property damage, denial of service attacks, threats, harassment, abuse of individual user accounts, forgery and misrepresentation. Suspicious activity can be reported to the Slack channel [#incident-response](https://tailscale.slack.com/archives/C02SJSHV41H), or, for potentially sensitive incidents, to the Security Review Team or to the Chief Operating Officer (COO). Violations of the [Code of Conduct](http://go/code-of-conduct) should be reported to the Chief Operating Officer (COO).
+If you see something, say something. All Tailscale employees should immediately report suspected security incidents or suspicious activity that occurs at Tailscale, including but not limited to security incidents, physical injury, theft, property damage, denial of service attacks, threats, harassment, abuse of individual user accounts, forgery and misrepresentation. Suspicious activity can be reported to the Slack channel #incident-response, or, for potentially sensitive incidents, to the Security Review Team or to the Chief Operating Officer (COO). Violations of the [Code of Conduct](http://go/code-of-conduct) should be reported to the Chief Operating Officer (COO).
 
 All employees should watch for potentially suspicious activities, including:
 
@@ -40,7 +40,7 @@ Tailscale’s Security Review Team reviews and responds to potential third-party
 
 ### Incident response and remediation
 
-If a suspected incident is detected, it should be responded to following the [Incident response process](http://go/incident-response-process).
+If a suspected incident is detected, it should be responded to following the [Incident response process](/security-policies/incident-response-process/).
 
 We respond to reported incidents, and resolve and determine impact as soon as possible. We aim to remediate incidents as soon as possible.
 

diff --git a/incident-response-process/index.md b/incident-response-process/index.md
@@ -0,0 +1,67 @@
+---
+title: Incident response process
+slug: incident-response-process
+policy: true
+faq: false
+weight: TODO
+---
+
+### Incident response
+
+When a suspected incident is reported, it is first investigated by the SIGENG
+oncall. If it is suspected to be an incident, they should declare an incident,
+and identify the Incident Commander in the #incident-response Slack channel.
+The Incident Commander is responsible for:
+
+* If an incident is likely to require ongoing response and remediation efforts,
+  opening a GitHub issue in the tailscale/incidents repo to track updates to
+  the incident and creating a Google doc for collaborative work.
+* Classifying the severity of the incident, including scope and the risk of any
+  assets which may be affected. This can be further updated as information
+  changes, and may inform how we choose to react. Depending on the urgency of
+  the incident, this may be done after the fact.
+* Contacting vendors or coordinating to contact vendors, to validate if their
+  product may be compromised.
+* Appointing roles, including a communications lead, if needed.
+* Ensuring handoff between team members, for example, at the end of a work day.
+* Escalating to leadership if responses are insufficient.
+
+In addition to remediating the incident, Tailscale employees should also seek
+to put into place any corrective actions possible to lessen the impact of an
+incident.
+
+If an incident affects customers, including their data or their ability to use
+Tailscale, Tailscale may choose to proactively communicate the issue publicly.
+
+### Incident recovery
+
+If data or processes were disrupted by the incident, then the [BCP/DR policy](/security-policies/bcp-dr/)
+should be followed to remediate the issue.
+
+Once an incident is mitigated or otherwise closed, it is the Incident
+Commander’s responsibility to ensure that
+
+* The resolution is communicated to all affected parties, including external
+  customers, if applicable.
+* For incidents causing a production outage or loss of customer or other
+  critical data, a post-mortem is completed. This should include: details of
+  the incident, timeline of the incident, its impact, the actions taken to
+  mitigate or resolve it, the root cause(s), and the follow-up actions to
+  prevent the incident from recurring. Where applicable, some version of the
+  post-mortem may be shared with external affected parties. Newly identified
+  risks should be added to the risk register.
+
+### Incident classification
+
+An incident is an adverse event which affects Tailscale’s infrastructure or
+business operations in such a way that it compromises our ability to deliver
+the service customers expect. A vulnerability is not necessarily an incident;
+for example, a vulnerability not being actively exploited may require action,
+but not expedited action beyond existing vulnerability remediation processes.
+
+Incidents can be classified based on their severity:
+
+| Critical | Extreme or complete production outage, significantly degraded experience for >50% of Tailscale users, or customer or other critical data loss or corruption. |
+| High | Partial outage of some production functionality or in some regions, degraded experience for multiple customers with no workaround available, or suspected severe security breach. |
+| Medium | Non-critical functionality loss or degradation for some customers, with possible short-term workaround, or detection of unauthorized activity. |
+| Low | No current or known customer impact. |