title | description |
---|---|
System Criticality |
A framework for treating systems differently according to how critical they are |
We'd like to identify gaps in the health and resilience of our platform as well as prioritize efforts to address them, but our platform comprises dozens of systems with very different needs and uses. This framework aims to recognize those differences with a shared vocabulary, so we can set expectations for different systems accordingly.
Systems that are essential to basic business operations such as registration, authentication, browsing, inquiring, bidding, and buying. These systems understandably experience a relatively high throughput, and any disruptions can have sizable financial and brand impact.
Systems with limited throughput or public-facing functions. Disruptions may interfere with certain business operations or have mild financial or brand impact.
Internal utilities or systems with only occasional usage. Experimentation should be cheap and easy, so we embrace that some tools serve only a few individuals or use cases.
Retired systems, spikes, or time-bound experiments that aren't significantly used. There shouldn't ever be many of these systems, nor should product managers have to consider them. They aren't expected to satisfy the objectives below.
Each level has corresponding expectations for how the system is built, operated, and maintained. These are target expectations and existing systems may not fully comply [yet]. New systems should aim to abide by these as much as possible.
Level 1 | Level 2 | Level 3 | |
---|---|---|---|
Development |
|
|
|
Testing |
|
|
|
Deployment |
|
|
|
Performance |
|
||
Errors |
|
|
|
Monitoring |
|
|
|
Incident handling |
|
|
|
Data |
|
|
|
Security |
|
|
|
See the full list of projects🔒 for how individual systems align with these levels.