title	description
System Criticality	A framework for treating systems differently according to how critical they are

We'd like to identify gaps in the health and resilience of our platform as well as prioritize efforts to address them, but our platform comprises dozens of systems with very different needs and uses. This framework aims to recognize those differences with a shared vocabulary, so we can set expectations for different systems accordingly.

System Criticality Framework

Level 3 (critical):

Systems that are essential to basic business operations such as registration, authentication, browsing, inquiring, bidding, and buying. These systems understandably experience a relatively high throughput, and any disruptions can have sizable financial and brand impact.

Level 2 (important):

Systems with limited throughput or public-facing functions. Disruptions may interfere with certain business operations or have mild financial or brand impact.

Level 1 (supported):

Internal utilities or systems with only occasional usage. Experimentation should be cheap and easy, so we embrace that some tools serve only a few individuals or use cases.

Level 0 (unsupported):

Retired systems, spikes, or time-bound experiments that aren't significantly used. There shouldn't ever be many of these systems, nor should product managers have to consider them. They aren't expected to satisfy the objectives below.

System expectations (draft)

Each level has corresponding expectations for how the system is built, operated, and maintained. These are target expectations and existing systems may not fully comply [yet]. New systems should aim to abide by these as much as possible.

	Level 1	Level 2	Level 3
Development	READMEs contain up-to-date set-up instructions Code review on all pull requests New systems or significant new components undergo tech review	←ditto	←ditto and... Production environment is replicable locally (TODO)
Testing	Automated tests Tests are run on PRs and `master` by a Continuous Integration pipeline	←ditto	←ditto and... Test coverage tooling
Deployment	Zero-downtime deployment	←ditto and... High-fidelity staging environment	←ditto and... Deployment and orchestration by Kubernetes
Performance			Latency-based monitors (e.g. p90) and alerting, tailored to service
Errors		Error tracking (e.g., Sentry)	←ditto and... Error rate alerting (e.g., Datadog monitors)
Monitoring	External availability monitoring	←ditto and... Application instrumentation (e.g., Datadog)	←ditto
Incident handling	Standard incident response times (pending improved notifications)	←ditto	←ditto and... Downtime automatically reported to Opsgenie as incidents Incidents are reported and updated on public status page
Data	Automated backups PII usage is avoided and, when necessary, documented and integrated with account-deletion procedures	←ditto	←ditto and... Production data or a subset synced to staging environment
Security	Github vulnerability-tracking enabled Database encryption at rest advised for new systems X response time for bounties/vulnerabilities (TODO)	←ditto and... Database encryption at rest required	←ditto

See the full list of projects🔒 for how individual systems align with these levels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

system-criticality.md

system-criticality.md

System Criticality Framework

Level 3 (critical):

Level 2 (important):

Level 1 (supported):

Level 0 (unsupported):

System expectations (draft)

Files

system-criticality.md

Latest commit

History

system-criticality.md

File metadata and controls

System Criticality Framework

Level 3 (critical):

Level 2 (important):

Level 1 (supported):

Level 0 (unsupported):

System expectations (draft)