Skip to content

Latest commit

 

History

History
53 lines (39 loc) · 7.35 KB

system-criticality.md

File metadata and controls

53 lines (39 loc) · 7.35 KB
title description
System Criticality
A framework for treating systems differently according to how critical they are

We'd like to identify gaps in the health and resilience of our platform as well as prioritize efforts to address them, but our platform comprises dozens of systems with very different needs and uses. This framework aims to recognize those differences with a shared vocabulary, so we can set expectations for different systems accordingly.

System Criticality Framework

Level 3 (critical):

Systems that are essential to basic business operations such as registration, authentication, browsing, inquiring, bidding, and buying. These systems understandably experience a relatively high throughput, and any disruptions can have sizable financial and brand impact.

Level 2 (important):

Systems with limited throughput or public-facing functions. Disruptions may interfere with certain business operations or have mild financial or brand impact.

Level 1 (supported):

Internal utilities or systems with only occasional usage. Experimentation should be cheap and easy, so we embrace that some tools serve only a few individuals or use cases.

Level 0 (unsupported):

Retired systems, spikes, or time-bound experiments that aren't significantly used. There shouldn't ever be many of these systems, nor should product managers have to consider them. They aren't expected to satisfy the objectives below.

System expectations (draft)

Each level has corresponding expectations for how the system is built, operated, and maintained. These are target expectations and existing systems may not fully comply [yet]. New systems should aim to abide by these as much as possible.

Level 1 Level 2 Level 3
Development
  • READMEs contain up-to-date set-up instructions
  • Code review on all pull requests
  • New systems or significant new components undergo tech review
  • ←ditto
  • ←ditto and...
  • Production environment is replicable locally (TODO)
Testing
  • Automated tests
  • Tests are run on PRs and master by a Continuous Integration pipeline
  • ←ditto
  • ←ditto and...
  • Test coverage tooling
Deployment
  • Zero-downtime deployment
  • ←ditto and...
  • High-fidelity staging environment
  • ←ditto and...
  • Deployment and orchestration by Kubernetes
Performance
  • Latency-based monitors (e.g. p90) and alerting, tailored to service
Errors
  • Error tracking (e.g., Sentry)
  • ←ditto and...
  • Error rate alerting (e.g., Datadog monitors)
Monitoring
  • External availability monitoring
  • ←ditto and...
  • Application instrumentation (e.g., Datadog)
  • ←ditto
Incident handling
  • ←ditto
  • ←ditto and...
  • Downtime automatically reported to Opsgenie as incidents
  • Incidents are reported and updated on public status page
Data
  • Automated backups
  • PII usage is avoided and, when necessary, documented and integrated with account-deletion procedures
  • ←ditto
  • ←ditto and...
  • Production data or a subset synced to staging environment
Security
  • Github vulnerability-tracking enabled
  • Database encryption at rest advised for new systems
  • X response time for bounties/vulnerabilities (TODO)
  • ←ditto and...
  • Database encryption at rest required
  • ←ditto

See the full list of projects🔒 for how individual systems align with these levels.