New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated Capacity Controller Workqueue Entries #1161
Comments
cc @pohly |
Also, using a literal The current log shows segments using their hex pointer values, which is not useful in determining which segment this line is for.
|
The problem that I had to solve here is that the content of a Perhaps that second part is going wrong somehow?
|
Could it also work by using the string value of the segment in external-provisioner/pkg/capacity/topology/topology.go Lines 38 to 40 in e7c0bcc
We can try implement another string function that prints the
|
I see two places that call
The segments are all returned from external-provisioner/pkg/capacity/capacity.go Lines 473 to 476 in e7c0bcc
|
Why should this not work? A struct is equal if all of its fields are equal. workItem contains two comparable fields, so it should be comparable: external-provisioner/pkg/capacity/capacity.go Lines 102 to 105 in e7c0bcc
Yes, that would also be possible. But I expect the string handling to be more expensive than the current pointer comparison. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What happened:
We have a CSI plugin that creates local disk backed PVCs to pods.
In a cluster with 1693 nodes that runs the CSI plugin, there are
CSIStorageCapacity
objects with duplicatednodeTopology x storageClassName
combinations.In addition, the duplicated
CSIStorageCapacity
are likely due to duplicated workqueue items in the capacity controller, leading to huge workqueue backlog.Our cluster currently has 1693 kubelet nodes, and by right there should be 2 * 1693 unique
CSIStorageCapacity
objects (2 storage classes for each node)This is the workqueue depth of the capacity controller,
csistoragecapacity
depth is between 5k - 6kIn addition,
csistoragecapacities_desired_goal
is around 6.18kThere are also inconsistencies in how the controller determines desired
CSIStorageCapacity
object count. An earlier instance of the external provisioner pod in the same cluster with effectively the same node count (+/- 1 node) did calculate the correct amount ofCSIStorageCapacity
objects, and workqueue depth looked much betterTheory for the root cause:
I believe there is currently an issue where the capacity controller is tracking duplicated workqueue entries. This caused the delayed
CSIStorageCapacity
creation due to the controller working through a huge workqueue backlog.workItem
is the key to the mapcapacities map[workItem]*storagev1.CSIStorageCapacity
. The fieldworkItem.segment
is a pointer*topology.Segment
When checking if a key exists here https://github.com/kubernetes-csi/external-provisioner/blob/v3.6.2/pkg/capacity/capacity.go#L473 even if
item.segment
has the same value as the supposedworkItem
key inc.capacities
,found
will almost always be false becausec.capacities[item]
is comparing the pointer value*topology.Segment
, not the literal values withintopology.Segment
. I've also tested this out in a similar setup https://go.dev/play/p/bVwW4eb9n5A to validate the theory.Output:
This leads to duplicated work items tracked by the controller, and it will poll duplicated items in https://github.com/kubernetes-csi/external-provisioner/blob/v3.6.2/pkg/capacity/capacity.go#L517-L520
What you expected to happen:
No duplicated workqueue entries for
CSIStorageCapacity
objects.How to reproduce it:
Anything else we need to know?:
Environment:
kubectl version
): v1.28.5uname -a
):The text was updated successfully, but these errors were encountered: