Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxonomy of backup / redundancy / failsafe levels #527

Open
Tracked by #285
mbuechse opened this issue Mar 18, 2024 · 15 comments · May be fixed by #579
Open
Tracked by #285

Taxonomy of backup / redundancy / failsafe levels #527

mbuechse opened this issue Mar 18, 2024 · 15 comments · May be fixed by #579
Assignees
Labels
needs refinement User stories that need to be refined for further progress question Further information is requested SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Milestone

Comments

@mbuechse
Copy link
Contributor

Do we have a taxonomy of failsafe levels?

For instance, the emerging standard on volume types refers to replication, and in this case, it is mostly to protect against a failure of a storage device, so we specify neither the number of replicas nor whether they should span multiple zones etc. In other cases, however, we might want to protect against power loss or fire or other risks.

So it would be interesting to define multiple levels of "failsafe" that may be applied to replication, backups and the like, and to establish handy nomenclature.

@mbuechse mbuechse added question Further information is requested needs refinement User stories that need to be refined for further progress SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10 labels Mar 18, 2024
@mbuechse mbuechse added this to the R7 (v8.0.0) milestone Mar 18, 2024
@mbuechse mbuechse self-assigned this Mar 18, 2024
@josephineSei
Copy link
Contributor

josephineSei commented Mar 26, 2024

Use Cases for Redundancy and their probability

Use Case Probability Consequences
Disk Failure/Loss High Data loss on this disk. Impact depends on type of lost data (data base, user data)
Node Outage Medium to High Data loss on node / (temporary) loss of functionality and connectivity of node (impact depends on type of node)
Rack Outage Medium similar to Disk Failure and Node Outage
Power Outage (Data Center supply) Medium potential data loss, temporary loss of functionality and connectivity of node (impact depends on type of node)
Fire Medium permanently Disk and Node loss in the affected zone
Flood Low permanently Disk and Node loss in the affected zone
Earthquake Very Low permanently Disk and Node loss in the affected zone
Storm/Tornado Low permanently Disk and Node loss in the affected fire zone
Cyber threat High permanently loss of data on affected Disk and Node

Grouping those use cases

Group level of affection Use Cases
1. Level single volumes, VMs... Disk Failure, Node outage, (maybe rack outage)
2. Level number of resources, most of the time recoverable Rack outage, (Fire), (Power outage when different power supplies exist)
3. Level lots of resources / user data + potentially not recoverable Fire, Earthquake, Storm/Tornado, Power Outage
4. Level complete deployment, not recoverable Flood, Fire

Redundancy in OpenStack Resources

Replication of Volumes

Replication of Volumes can be achieved when either a backend is used, that already provides replication OR it can be defined in a volume type, which then uses two different (instances of) backends.

The replication provided here can vary from just being replicated on a different disk/ssd in the same rack or up to mirroring to another Rack and depending on the physical infrastructure up to being mirrored into a different fire zone.

Nevertheless volume replication should be seen as the simplest kind of redundancy (Level 1). The one that is used for simple hardware failure such as a disk failure or a node failure.

Replication of Objects (Object Storage)

Nowadays most Installations do not use Swift directly which means dealing with the Swift internal replication (https://docs.openstack.org/swift/latest/overview_replication.html) is not needed.

Instead many deployments use RBD RadosGW with a ceph backend. Here the internal ceph replication can be used.

This results in the replication being mainly used as the simplest kind of redundancy (Level 1). Due to the nature of objects, they can be easily replicated and stored by the users themself. So a user can easily store their object data in different locations which would result in a redundancy of Level 4.

Replication of server data (VMs)

Due to the nature of data in use and their constant changing, it is not easy to provide replication on the IaaS-Layer.

There are two different versions of VMs that needs to be considered:

  1. ephemeral storage VMs

  2. volume-based VMs

  3. The ephemeral storage is stored directly on the compute node. This in combination with the data being in used and updated makes replication impossible. Everything that should be redundant, MUST be handled by the IaaS-User.

  4. Volume-based VMs make use of possible replication of the Volumes. As soon as new data is written or updated, the blocks are automatically replicated through the volume storage solution. This means replication of Level 1 would be given. BUT: due to the data being in use and blocks being written instead of files, the consistency of the data is not always given, and there will be a "short" delay while transferring data from the compute node to the storage backend.
    In this case users will always have the need to check for consistency themself and be aware of having maybe outdated (up to a few minutes) data. In case of a node outage (compute host) there is a good chance of being able to reconstruct user data from just a few minutes ago.
    For any higher level of redundancy users SHOULD involve redundancy mechanisms on higher layers than IaaS.

Replication of Secrets

The keys used for volume encryption and stored in Barbican are necessary to decrypt user data. The keys are stored encrypted (a key is encrypted with an project-key-encryption-key and that KEK is again stored encrypted with a master kek) in a simple data base. The Project-KEKs are either stored in a database or in an HSM and the Master-KEK is stored according to the Barbican Plugin. The database is always deployed redundant on different nodes and would survive Level 1 and maybe Level 2 Failure. Barbican should always be deployed redundant too, which makes the Master-KEK the only possible single point of failure. It depends on where the Master KEK is stored (within each Barbican instance or within a Network-HSM-cluster would be failure safe for Level 1 and maybe Level 2): if it is stored in a single HSM a failure of that would render all encrypted data (that is not in use) impossible to access.

The CSP could be able to backup the MasterKEK either through the life-cycle tool or through a dedicated backup. Which would help here.

@josephineSei
Copy link
Contributor

@mbuechse I can go on and cover the other OpenStack resources or you can.

@mbuechse
Copy link
Contributor Author

@josephineSei Feel free to cover more. I mainly created the issue so we don't forget about the topic. I wanted to bring it up in the IaaS call to collect opinions, but probably not before next week. If you want, you can bring it up even this week.

@anjastrunk
Copy link
Contributor

IMO: This taxonomy influences several standards in IaaS and KaaS, such as availability zones and regions in OpenStack or base security features in Kubernetes. We should create a task force containing of all stakeholders.

@anjastrunk
Copy link
Contributor

@josephineSei Is there any state-of-the-art taxonomy?

@anjastrunk
Copy link
Contributor

anjastrunk commented Apr 25, 2024

Outcome from brainstorming meeting 25.04.24:

  • We should differ between temporary loss and permanent loss
  • Cloud resources (IaaS):
    • user data on volumes
    • user data on images
    • user data on RAM/CPU
    • volume-based VMs
    • ephemeral-storage-based VMs
    • secrets in data base
    • network configuration data (router, ports, security groups, ...) in data base
    • network connectivity (infrastructure materialized from network configuration data in data base)
    • floating IPs

Challenges regarding network configuration:

  • Assume following use case: volume-based VM is running on host, which will be powered off. Re-start of VM on new host and re-building of network connectivity will not take place automatically. It must be triggered manually.

@josephineSei @markus-hentsch: Write DR for taxonomy.

@anjastrunk anjastrunk assigned josephineSei and unassigned mbuechse Apr 25, 2024
@horazont
Copy link
Member

Volume-based VMs make use of possible replication of the Volumes. As soon as new data is written or updated, the blocks are automatically replicated through the volume storage solution. This means replication of Level 1 would be given. BUT: due to the data being in use and blocks being written instead of files, the consistency of the data is not always given, and there will be a "short" delay while transferring data from the compute node to the storage backend.

In this case users will always have the need to check for consistency themself and be aware of having maybe outdated (up to a few minutes) data. In case of a node outage (compute host) there is a good chance of being able to reconstruct user data from just a few minutes ago.

I don't think that is completely accurate. Volume backends are expected to not lie about data persistence to applications, which means that the data written on volumes is as consistent as a (cacheless) disk power outage. That is something all resilient applications (such as databases and filesystems) develop against (as a "threat" model), so it is a scenario which should be handled reasonably well.

In other words, if your SQL database server (such as PostgreSQL) returns successful from a COMMIT, you can expect, even on Volume storage, that if you pull the plug from the hypervisor and floor(N/2) of the Ceph replicas immediately afterwards and spin it up on a different hypervisor with the same volume, the committed data is in fact available 1.


As for the taxonomy, I'm not aware of anything, but I'll ask around some more. I mostly deal with the development side of things in certifications (such as ISO27001) and less with operations.

Footnotes

  1. I think there are in fact cache modes one can set in OpenStack which break this guarantee. Don't do that.

@josephineSei
Copy link
Contributor

I had a look through the ISO27001 norm, but I do not find something specific for the case we want to define. So I begin with the work on the DR.

@josephineSei josephineSei linked a pull request Apr 26, 2024 that will close this issue
@markus-hentsch
Copy link
Contributor

I had a look through the ISO27001 norm, but I do not find something specific for the case we want to define. So I begin with the work on the DR.

I also had a look around the net and was astonished to find almost nothing in terms of a standardized classification scheme in the context of data center infrastructure risks (like illustrated in #527 (comment)).
The Data centre tier list often comes up but is rather the resulting classification of the data center in how well it manages the risks - not a classification of risks and implications themselves, as far as I understand it.

A lot of results are just disguised ads for consultancy services and the few concrete documents that turn up often focus solely on security risks (example). It seems we indeed need to go forward with our own classification for now until we discover something suitable.

@josephineSei
Copy link
Contributor

As @horazont suggested I looked into some guidelines of the BSI and found that:
https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/RZ-Sicherheit/RZ-Verfuegbarkeitsmassnahmen.pdf?__blob=publicationFile&v=9

But within that document the BSI mostly discusses what we would classify as level 3 or 4 . There is a lot of detailed description about power supply, cooling, protection against nature catastrophies and human threat (== cyber attacks).

We (as describing from the point of looking at user workload) may have a broader view and would need another way than this documents describe to classify levels of IaaS failsafeness.

@garloff
Copy link
Contributor

garloff commented May 2, 2024

After disks and memory DIMMs, I have seen switch failures as a somewhat common failure case.

@josephineSei
Copy link
Contributor

I am reading through some more documents of the BSI:

  1. This document describes in detail what redundancy can be and which forms exist. It also describes other forms like separation, etc... But there are no classes of failure safety described.
  2. This document describes in detail what risks exist, what ressources they affect and what their probability is. This is in detail what we have done in the first table - but it does not offer consolidated classes (which may not be possible, because the BSI is targeting all kind of IT infrastructure not just IaaS and PaaS/CaaS)
  3. This document also describes risks in detail.

What do we want to achieve:
We want to easily show to everyone reading a standard, that there are different cases of redundancy and to which extend these cases provide safety against failure. It should be clear that (without explicitly writing it down every time) a replicated volume will grant safety regarding a small hardware failure but not against nature catastrophe that will destroy the whole data center.
In this way it should be clear for users who wants to protect themself against a Level 4 failure, that there is no CSP that can do anything about it (unless replication in another geographically distanced data center is possible). So here it will always be part of the user to protect themself against such failure cases.

@josephineSei
Copy link
Contributor

I propose to use either the slot of the Standardization SIG in a week it does not take place (this week or the 06.06.) or to use next Tuesday. I wrote to Kurt to post it to the ML.

@josephineSei
Copy link
Contributor

The session for discussion will take place at 23.05. I wrote a mail to the ML.

@markus-hentsch
Copy link
Contributor

The session for discussion will take place at 23.05. I wrote a mail to the ML.

During the breakout session we started by discussing what the main purpose of the taxonomy standard should be, in order to shape the discussion and further research on this topic accordingly.

Possible purposes that we discussed:

  1. accompanying document for standards, can be used as reasoning for standard decisions in other standards as well as classification reference mentioned in individual standards
  2. documentation for users, clarifying which risks are addressed within a SCS-compliant cloud
    • "how does the SCS project attempt to address risk X?"
  3. documentation for CSPs
    • "which risk protection do I achieve as a CSP when I apply all SCS standards?"
  4. guidelines/documentation

Discussion result:

  • primarily 1
  • also 2 and 3 but not explicitly as part of the taxonomy standard, but rather implicitly resulting from references within individual standards to the classifications defined in the taxonomy standard
  • 4 is mostly out of scope for SCS, we could give small hints here only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs refinement User stories that need to be refined for further progress question Further information is requested SCS is standardized SCS is standardized SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Doing
Development

Successfully merging a pull request may close this issue.

6 participants