Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy of failsafe levels #579

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

josephineSei
Copy link
Contributor

closes #527

Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Copy link
Contributor

@markus-hentsch markus-hentsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good write-up. I added some spelling, phrasing and terminology adjustment suggestions.

Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
Standards/scs-XXXX-vN-taxonomy-of-failsafe-levels.md Outdated Show resolved Hide resolved
josephineSei and others added 2 commits April 29, 2024 09:51
Co-authored-by: Markus Hentsch <129268441+markus-hentsch@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>

| Term | Explanation |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
| Virtual Machine | Equals the `server` resource in Nova. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be careful here. If you have integrated Ironic with Nova, a server in Nova could also be a physical node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We definitely need to separate between VMs and Ironic nodes.

Copy link
Contributor

@anjastrunk anjastrunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this DR is not in a final state and we should go for another round on discussion

Comment on lines 38 to 39
Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.
This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some glue text is missing here. I would re-phrase the sentence, as follows

Suggested change
Some standards provided by the SCS will talk about or require procedures to backup resources or have redundancy for resources.
This decision record should discuss, which failure threats are CSP-facing and will classify them into several levels.
Some standards provided by the SCS project will talk about or require procedures to backup resources or have redundancy for resources. As these terms are neither officially defined nor intuitive, this decision record tries to get some clarity in this topic. It discuss, which failure threats are cloud service provider facing and classifies them into several levels.


## Decision

First there needs to be an overview about possible failure cases in infrastructures:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
First there needs to be an overview about possible failure cases in infrastructures:
First there needs to be an overview about possible failure cases in infrastructures as well as their probability of entry and the damage they may cause.


| Failure Case | Probability | Consequences |
|----|-----|----|
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In favor to simplicity, I would assume disk loss/failure will cause permanent loss of data on this disk.

Suggested change
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
| Disk Failure/Loss | High | Permanent data loss in on this disk. Impact depends on type of lost data (data base, user data) |

| Failure Case | Probability | Consequences |
|----|-----|----|
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to differ between Node Failure/Loss, meaning hardware is irrecoverable damaged and node outage, caused by electricity outage, as both use cases cause different implications. Furthermore, we should define node as computation hardware without disks. This facilitates classification of use case.

Suggested change
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
| Node Failure/Loss (without disks) | Medium to High | Permanent loss of functionality and connectivity of node (impact depends on type of node) |
| Node Outage | Medium to High | Temporary loss of functionality and connectivity of node (impact depends on type of node) |

|----|-----|----|
| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
| Rack Outage | Medium | similar to Disk Failure and Node Outage |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rack outage means outage of all nodes. As disks are not damaged, I prefer to limit consequences to

Suggested change
| Rack Outage | Medium | similar to Disk Failure and Node Outage |
| Rack Outage | Medium | Outage of all nodes in rack |

| Disk Failure/Loss | High | Data loss on this disk. Impact depends on type of lost data (data base, user data) |
| Node Outage | Medium to High | Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node) |
| Rack Outage | Medium | similar to Disk Failure and Node Outage |
| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, I would omit "data loss" and focus on big consequence. Most protocols are working with acknowledgments. Hence, we can assume, that data loss is temporary. What we really lost are CPU and RAM data, but we should omit these consequences, as we can not prevent or avoid them.

Suggested change
| Power Outage (Data Center supply) | Medium | potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node) |
| Power Outage (Data Center supply) | Medium | temporary outage of all nodes in rack (impact depends on type of node) |

| Storm/Tornado | Low | permanent Disk and Node loss in the affected fire zone |
| Cyber threat | High | permanent loss or compromise of data on affected Disk and Node |

These failure cases can result in temporary (T) or permanent (P) loss of the resource or data within.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this table. It is a very good starting point to think about backup and failure strategies. IMO, its content is not in a final state. As well as the following paragraphs. They look a bit incomplete. We should take an other round on brainstorming to take OpenStack services and k8s resources and services into account as well.

At the end, I like to see an overview on how use cases defined in upper table effect availability of all OpenStack and k8s resources and services. This overview MUST include all fail-safe strategies SCS project requires together with the use case they prevent. Maybe this overview would a separate standard or references other standards. We have to discuss this.

Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>
Signed-off-by: josephineSei <128813814+josephineSei@users.noreply.github.com>

:::

| Level/Class | Probability | Failure Causes | loss in IaaS | User Hints |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still thinking about the "user hints" column. Putting it next to the other columns is good from some perspectives, as it can be read: I want to achieve 2. Level of failuresafeness, which can be triggered by these failure causes that will result in these losses on IaaS level, so I can do, what is shown in the user hints.
But we wanted the classification not for examples for users, but mainly as a definiton for standards, so maybe we should not reference those standards here.
We could rather use an extra table with example actions(standards, "user has to to things",..) for each level/class or maybe this should rather not be in a decision record, but rather in a guide or so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Taxonomy of backup / redundancy / failsafe levels
4 participants