Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One bad CSI volume can stop all volumes from being scheduled #3120

Open
s4ke opened this issue Feb 14, 2023 · 11 comments
Open

One bad CSI volume can stop all volumes from being scheduled #3120

s4ke opened this issue Feb 14, 2023 · 11 comments

Comments

@s4ke
Copy link
Contributor

s4ke commented Feb 14, 2023

While implementing CSI support for Hetzner Cloud we ran into some strange behaviour around cases where one "broken" volume can cause all other volumes to not schedule properly.

In the case of the Hetzner plugin this was caused by Volumes created in hel1 generally not being allowed to be scheduled in nbg1 - which is expected. But due to the fact that this volume was also tried to be published in the node in nbg1 (How would we prevent this? Do/Should Cluster volumes have support for placement constraints?) this will never work on these nodes. Another volume created in nbg1, which should succeed, was however blocked by the earlier failure from succeeding.

Only after force removing all volumes and services and starting from scratch (and this time making sure to only schedule stuff in nbg1), we were able to properly create a volume.

see hetznercloud/csi-driver#376 (comment) for further details

@s4ke
Copy link
Contributor Author

s4ke commented Feb 14, 2023

@neersighted should we create another issue for cluster volumes not having support for placement constraints?

@s4ke
Copy link
Contributor Author

s4ke commented May 16, 2023

@dperny Can you take a look here - the linked comment on the hetznercloud csi driver explains the issue quite well? I would love to help in any way I can here.

@s4ke s4ke changed the title One bad CSI volume can stop the all volumes from being scheduled One bad CSI volume can stop all volumes from being scheduled May 20, 2023
@dperny
Copy link
Collaborator

dperny commented Jun 20, 2023

I'm taking a look at this issue now.

@dperny
Copy link
Collaborator

dperny commented Jun 21, 2023

@s4ke I believe there are several issues afoot here:

  1. There is some error causing volumes to attempt to be scheduled to invalid nodes.
  2. There is some error resulting in other errors locking up the volume management component.
  3. There is some other error that may result in nodes incorrectly reporting volume status to the managers.

It's all quite nasty.

@s4ke
Copy link
Contributor Author

s4ke commented Jun 21, 2023

@dperny okay. Thanks for taking a look. Were you able to reproduce this issue? Do you think that this is something on the driver level or in swarmkit?

@dperny
Copy link
Collaborator

dperny commented Jun 21, 2023

So, the open questions I have right now about the linked issue:

The Volume is getting scheduled to a node which is outside of its availability constraint. This is odd. However, the CSIInfo field shows only PluginName, NodeID, and MaxVolumesPerNode. It does not seem to show AccessibleTopology. CSI volumes are scheduled based on AccessibleTopology as reported by the Node (which it gets from the plugin), and not by Labels on the Node (which are used for regular placement constraints).

But that said, even a blank AccessibleTopology should not result in a decision to schedule the Volume to a the Node. I've checked the function myself, even written a test. It should not be scheduled. So the question remains, why?

Further, there is some issue that I know of that I believe is related by which a Volume object loses the PublishStatus for all nodes, and ends up back in Created status. So I know there is an issue, somewhere, with the Volume PublishStatus being incorrectly set. The question for that is also why?

Next, why is the Volume ID not being included in the ControllerPublishVolume request, as the logs seem to indicate? That shouldn't be possible. The Volume is not considered Created until it has its ID.

There's a rats nest of issues that I suspect are all related to a small set of root causes. Whoever wrote this CSI code is a doofus.

@s4ke
Copy link
Contributor Author

s4ke commented Jun 21, 2023

I will see what I can do to help answer your questions and when.

@s4ke
Copy link
Contributor Author

s4ke commented Jun 21, 2023

Could this be related? moby/moby#45547

@dperny
Copy link
Collaborator

dperny commented Jun 22, 2023

That's exactly the issue I had in mind.

@dperny
Copy link
Collaborator

dperny commented Jun 22, 2023

OK, for starters, I have figured out one problem.

This is where we convert the gRPC response into Docker API objects:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/daemon/cluster/convert/node.go#L59-L70

And this is the Docker API object in question:

https://github.com/moby/moby/blob/b3843992fc12536908fea2fea3ece05725b1e613/api/types/swarm/node.go#L72-L85

It seems I am forgetting something critical in the conversion. AccessibleTopology is being ignored in the conversion, which makes debugging this issue difficult. The Scheduler takes into account the AccessibleTopology of the Node as reported by the CSI plugin to make its scheduling decisions.

Without knowing what AccessibleTopology is being reported, I cannot know if the Scheduler is making an error, or is suffering from garbage-in-garbage-out.

@s4ke
Copy link
Contributor Author

s4ke commented Sep 26, 2023

It's been a while since I was in this discussion. I honestly lost track of where we are with this. How can I help with this? Are the questions still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants