Skip to content

Commit

Permalink
Add missing incident response process doc
Browse files Browse the repository at this point in the history
Also update go links pointing to it and remove slack channel links.
  • Loading branch information
awly committed Aug 30, 2023
1 parent e519c5a commit 7639728
Show file tree
Hide file tree
Showing 3 changed files with 70 additions and 3 deletions.
2 changes: 1 addition & 1 deletion bcp-dr/index.md
Expand Up @@ -34,4 +34,4 @@ An incident could be detected internally by monitoring tools, by an employee in

### Outage response and remediation

If a suspected outage or other business continuity incident is detected, it should be responded to following the [Incident response process](http://go/incident-response-process).
If a suspected outage or other business continuity incident is detected, it should be responded to following the [Incident response process](/security-policies/incident-response-process).
Expand Up @@ -16,7 +16,7 @@ The following minimum standards apply to Tailscale’s assets as managed by empl

An incident could be detected internally by an employee in their course of work, by an employee or vendor doing a review of Tailscale’s security posture, or an external third party reporting a potential vulnerability to us.

If you see something, say something. All Tailscale employees should immediately report suspected security incidents or suspicious activity that occurs at Tailscale, including but not limited to security incidents, physical injury, theft, property damage, denial of service attacks, threats, harassment, abuse of individual user accounts, forgery and misrepresentation. Suspicious activity can be reported to the Slack channel [#incident-response](https://tailscale.slack.com/archives/C02SJSHV41H), or, for potentially sensitive incidents, to the Security Review Team or to the Chief Operating Officer (COO). Violations of the [Code of Conduct](http://go/code-of-conduct) should be reported to the Chief Operating Officer (COO).
If you see something, say something. All Tailscale employees should immediately report suspected security incidents or suspicious activity that occurs at Tailscale, including but not limited to security incidents, physical injury, theft, property damage, denial of service attacks, threats, harassment, abuse of individual user accounts, forgery and misrepresentation. Suspicious activity can be reported to the Slack channel #incident-response, or, for potentially sensitive incidents, to the Security Review Team or to the Chief Operating Officer (COO). Violations of the [Code of Conduct](http://go/code-of-conduct) should be reported to the Chief Operating Officer (COO).

All employees should watch for potentially suspicious activities, including:

Expand All @@ -40,7 +40,7 @@ Tailscale’s Security Review Team reviews and responds to potential third-party

### Incident response and remediation

If a suspected incident is detected, it should be responded to following the [Incident response process](http://go/incident-response-process).
If a suspected incident is detected, it should be responded to following the [Incident response process](/security-policies/incident-response-process/).

We respond to reported incidents, and resolve and determine impact as soon as possible. We aim to remediate incidents as soon as possible.

Expand Down
67 changes: 67 additions & 0 deletions incident-response-process/index.md
@@ -0,0 +1,67 @@
---
title: Incident response process
slug: incident-response-process
policy: true
faq: false
weight: TODO
---

### Incident response

When a suspected incident is reported, it is first investigated by the SIGENG
oncall. If it is suspected to be an incident, they should declare an incident,
and identify the Incident Commander in the #incident-response Slack channel.
The Incident Commander is responsible for:

* If an incident is likely to require ongoing response and remediation efforts,
opening a GitHub issue in the tailscale/incidents repo to track updates to
the incident and creating a Google doc for collaborative work.
* Classifying the severity of the incident, including scope and the risk of any
assets which may be affected. This can be further updated as information
changes, and may inform how we choose to react. Depending on the urgency of
the incident, this may be done after the fact.
* Contacting vendors or coordinating to contact vendors, to validate if their
product may be compromised.
* Appointing roles, including a communications lead, if needed.
* Ensuring handoff between team members, for example, at the end of a work day.
* Escalating to leadership if responses are insufficient.

In addition to remediating the incident, Tailscale employees should also seek
to put into place any corrective actions possible to lessen the impact of an
incident.

If an incident affects customers, including their data or their ability to use
Tailscale, Tailscale may choose to proactively communicate the issue publicly.

### Incident recovery

If data or processes were disrupted by the incident, then the [BCP/DR policy](/security-policies/bcp-dr/)
should be followed to remediate the issue.

Once an incident is mitigated or otherwise closed, it is the Incident
Commander’s responsibility to ensure that

* The resolution is communicated to all affected parties, including external
customers, if applicable.
* For incidents causing a production outage or loss of customer or other
critical data, a post-mortem is completed. This should include: details of
the incident, timeline of the incident, its impact, the actions taken to
mitigate or resolve it, the root cause(s), and the follow-up actions to
prevent the incident from recurring. Where applicable, some version of the
post-mortem may be shared with external affected parties. Newly identified
risks should be added to the risk register.

### Incident classification

An incident is an adverse event which affects Tailscale’s infrastructure or
business operations in such a way that it compromises our ability to deliver
the service customers expect. A vulnerability is not necessarily an incident;
for example, a vulnerability not being actively exploited may require action,
but not expedited action beyond existing vulnerability remediation processes.

Incidents can be classified based on their severity:

| Critical | Extreme or complete production outage, significantly degraded experience for >50% of Tailscale users, or customer or other critical data loss or corruption. |
| High | Partial outage of some production functionality or in some regions, degraded experience for multiple customers with no workaround available, or suspected severe security breach. |
| Medium | Non-critical functionality loss or degradation for some customers, with possible short-term workaround, or detection of unauthorized activity. |
| Low | No current or known customer impact. |

0 comments on commit 7639728

Please sign in to comment.