New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create v2 of node distribution standard (issues/#494) #524
base: main
Are you sure you want to change the base?
Create v2 of node distribution standard (issues/#494) #524
Conversation
Adds the new label topology.scs.openstack.org/host-id to the standard and extend the standard to require providers to set the labels on their managed k8s clusters. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Update the label topology.scs.openstack.org/host-id in the node distribution test to topology.scs.community/host-id. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
Let Team Container in its next meeting vote on whether this needs a v2; apart from that: good to go! |
Like discussed in the SIG Standardization and Certification Meeting, the standard is updated to v2 with the changes made to it and immediately set to "Stable", since no major changes were done. This comes mainly from the rushed job we did. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
Signed-off-by: Matthias Büchse <matthias.buechse@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
This should be done now. After some more approvals fly in, I will merge this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found two typos and a cosmetic issue – apart from that, it is good to go!
Like it was discussed in the Standardization/Certification meeting, we will set this v2 to a Draft state for now. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Some small cosmetic updates fixing some problems mentioned by @martinmo. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
I changed the document state to I also fixed (probably) all problems that had remained, so it should be mergable now. |
@mbuechse Please check this again, since this is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed yesterday in SIG std/cert, we still want to refine a few things.
Since some providers only have small environments to work with and therefore couldn't | ||
comply with this standard, it will be treated as a RECOMMENDED standard, where providers | ||
can OPT OUT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would drop this sentence. We should, however, provide implementation notes.
And we should clarify with the help of Team Container what really happens with Gardener (and the like): if there is a set of control-plane nodes that are being shared by the control planes of multiple clusters -- is it really the case that these nodes cannot be enumerated? Can we maybe get access to such a cluster (I think @berendt is working on that right now and we should get access soon)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why is this sentence still in here? Have you found out in the meantime (with the help of Team Container) what the situation is like with a shared pool of control-plane nodes, and how this would present itself to the test?
- To provide metadata about the node distribution, which also enables testing of this standard, | ||
providers MUST annotate their K8s nodes with the labels listed below. These labels MUST be kept | ||
up to date with the current state of the deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, we don't have any idea when this can be implemented with CAPO. @matofeder Do you have an update?
Unless the implemenation is in sight, we should demote this from a "MUST annotate" to a "SHOULD annotate". (The "MUST be kept up to date" can remain, so long as it's clear that it only applies if labels are being set at all.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chess-knight as you have better visibility of this topic now, please provide an update here, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The draft PR SovereignCloudStack/cluster-stacks#53 with hostId labeling workaround was created by @matofeder and @SovereignCloudStack/vp06c will continue that work ASAP.
Regarding upstream, there is no progress AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created an upstream issue kubernetes/cloud-provider-openstack#2579 so let's wait for their opinion on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this seems to be a low-hanging fruit in the opinion of the upstream guys, should I/we put the code in ourselves in order to speed the process up (kubernetes/cloud-provider-openstack#2579) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR SovereignCloudStack/cluster-stacks#53 is ready for review and it includes a workaround (creating server groups, instructing CAPO to use these server groups for nodes, labelling nodes with custom host-id label). kubernetes/cloud-provider-openstack#2579 is about labelling part and @jschoone wanted to contact them. The server group part is still missing in CAPO. I will try to comment in kubernetes-sigs/cluster-api-provider-openstack#1912 to speed it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I am unsure if it is low-hanging fruit, as I wrote in kubernetes/cloud-provider-openstack#2579 - Live migration can be problematic because the label will not be updated. IMO, it needs a discussion.
And what about bare-metal clusters? In that case, we do not need this specific label right?
Did some more textual updates requested by @mbuechse. Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine apart from the change that I indicated.
Since some providers only have small environments to work with and therefore couldn't | ||
comply with this standard, it will be treated as a RECOMMENDED standard, where providers | ||
can OPT OUT. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why is this sentence still in here? Have you found out in the meantime (with the help of Team Container) what the situation is like with a shared pool of control-plane nodes, and how this would present itself to the test?
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
No, I haven't (because I had two other meetings conflicting with the last container meeting). My problem cutting this part would still be that smaller CSPs can't necessarily provide a setup like this, especially for all their clusters. And that probably isn't necessary either. |
@mbuechse Please check this again and resolve the parts that are fine with you. I'm gonna coordinate the upstream work with @chess-knight now. |
@cah-hbaum Looking good, thanks for all the work! Now, only the matter of the "optionality" of this standard needs some discussion IMO. The certificate scope yaml permits marking a standard optional, but then I don't know what good the standard is, because most people, when they look at the certificate, they will never notice whether this particular standard was satisfied or not. And I still don't understand why this standard is so hard to satisfy after all, even for small providers. If it's indeed possible that multiple k8s clusters share the same hosts for their control plane, then each cluster can have three different hosts, and all clusters in total still only need three hosts, so where's the problem? |
I'm gonna update the standard to make it NOT optional and take it to a vote or request in the "SCS Container" Room. |
Signed-off-by: Hannes Baum <hannes.baum@cloudandheat.com>
Posted into the "SCS Container" Matrix room, just putting the whole thing in here as a paraphrase to make it persistent:
|
Another thing we also maybe need to discuss came up while working on #556. I installed a K8s cluster through In my case, is the cluster really "not distributed" just because the labels aren't there; the values even were available just under different labels. (Labels default in |
IMHO it is ok to require conformant implementations to set some additional labels, even if some work needs to be done initially to support it (e.g., in yaook/k8s). And especially with Standardized labels are not only useful for the conformance test. Another important use case is that the customers can use these labels to distribute their own applications over failure zones using Pod topology spread constraints. Also, in #524 (comment), an important question about
Theoretically, in a bare-metal cluster, the |
Yes, that would be desireable from my POV. People shouldn't be expected to check whether the cluster is running on bare metal or not. |
I double checked this: yaook/k8s connects to the OpenStack layer via the OpenStack cloud controller manager (CCM) which sets the |
Adds the new label topology.scs.openstack.org/host-id to the standard and extend the standard to require providers to set the labels on their managed k8s clusters.