Duplicated Capacity Controller Workqueue Entries #1161

yuxiang-he · 2024-02-22T02:28:43Z

What happened:
We have a CSI plugin that creates local disk backed PVCs to pods.

In a cluster with 1693 nodes that runs the CSI plugin, there are CSIStorageCapacity objects with duplicated nodeTopology x storageClassName combinations.

In addition, the duplicated CSIStorageCapacity are likely due to duplicated workqueue items in the capacity controller, leading to huge workqueue backlog.

Our cluster currently has 1693 kubelet nodes, and by right there should be 2 * 1693 unique CSIStorageCapacity objects (2 storage classes for each node)

This is the workqueue depth of the capacity controller, csistoragecapacity depth is between 5k - 6k

In addition, csistoragecapacities_desired_goal is around 6.18k

There are also inconsistencies in how the controller determines desired CSIStorageCapacity object count. An earlier instance of the external provisioner pod in the same cluster with effectively the same node count (+/- 1 node) did calculate the correct amount of CSIStorageCapacity objects, and workqueue depth looked much better

Theory for the root cause:
I believe there is currently an issue where the capacity controller is tracking duplicated workqueue entries. This caused the delayed CSIStorageCapacity creation due to the controller working through a huge workqueue backlog.

workItem is the key to the map capacities map[workItem]*storagev1.CSIStorageCapacity. The field workItem.segment is a pointer *topology.Segment

When checking if a key exists here https://github.com/kubernetes-csi/external-provisioner/blob/v3.6.2/pkg/capacity/capacity.go#L473 even if item.segment has the same value as the supposed workItem key in c.capacities, found will almost always be false because c.capacities[item] is comparing the pointer value *topology.Segment, not the literal values within topology.Segment. I've also tested this out in a similar setup https://go.dev/play/p/bVwW4eb9n5A to validate the theory.

package main

import "fmt"

type Foo struct {
	foo string
}

type PointerKey struct {
	foo *Foo
}

type NonPointerKey struct {
	foo Foo
}

func main() {
	pointerKeyMap := map[PointerKey]string{}
	pointerKeyMap[PointerKey{&Foo{foo: "1"}}] = "1"
	fmt.Printf("pointerKeyMap: %+v\n", pointerKeyMap)
	k := PointerKey{&Foo{foo: "1"}}
	_, ok := pointerKeyMap[k]
	fmt.Printf("key exists for %+v: %v\n", k, ok)

	nonPointerKeyMap := map[NonPointerKey]string{}
	nonPointerKeyMap[NonPointerKey{Foo{foo: "2"}}] = "2"
	fmt.Printf("nonPointerKeyMap: %+v\n", nonPointerKeyMap)
	k2 := NonPointerKey{Foo{foo: "2"}}
	_, ok = nonPointerKeyMap[k2]
	fmt.Printf("key exists for %+v: %v\n", k2, ok)
}

Output:

pointerKeyMap: map[{foo:0xc000096020}:1]
key exists for {foo:0xc000096040}: false
nonPointerKeyMap: map[{foo:{foo:2}}:2]
key exists for {foo:{foo:2}}: true

Program exited.

This leads to duplicated work items tracked by the controller, and it will poll duplicated items in https://github.com/kubernetes-csi/external-provisioner/blob/v3.6.2/pkg/capacity/capacity.go#L517-L520

What you expected to happen:
No duplicated workqueue entries for CSIStorageCapacity objects.

How to reproduce it:

Anything else we need to know?:

Environment:

Driver version: v3.6.2
Kubernetes version (use kubectl version): v1.28.5
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

yuxiang-he · 2024-02-22T02:38:59Z

cc @pohly

yuxiang-he · 2024-02-22T03:51:03Z

Also, using a literal segment value in workItem makes logs more meaningful.

The current log shows segments using their hex pointer values, which is not useful in determining which segment this line is for.

I0222 03:48:01.410043       1 capacity.go:574] Capacity Controller: refreshing {segment:0xc003ae26a8 storageClassName:ssd-shared}

pohly · 2024-02-22T13:46:43Z

The problem that I had to solve here is that the content of a topology.Segment is a slice, and slices cannot be used as keys in a Go map. The solution was to use the address of a topology.Segment as key for the map and then ensure that there are never any topology.Segment instances with the same content.

Perhaps that second part is going wrong somehow?

Also, using a literal segment value in workItem makes logs more meaningful.

workItem should implement logr.Marshaler such that it prints the pointer value (= the unique key) and the referenced segment, and log calls should use structured, contextual logging. Then we would get all the desired information.

yuxiang-he · 2024-02-22T17:18:31Z

Could it also work by using the string value of the segment in workItem? e.g. There is already a function that generates a string for a Segment

external-provisioner/pkg/capacity/topology/topology.go

Lines 38 to 40 in e7c0bcc

    
           func (s *Segment) String() string { 
        
           	return fmt.Sprintf("%p = %s", s, s.SimpleString()) 
        
           }

We can try implement another string function that prints the SegmentEntry in strictly sorted order and leave out the pointer value in the string to ensure that

the string is consistent for different workItems that have the same segment but with perhaps different SegmentItem order
Different segment pointers that point to slices with the same literal segment value generate the same string

yuxiang-he · 2024-02-22T18:00:23Z

Perhaps that second part is going wrong somehow?

I see two places that call addWorkitem which adds the items to c.capacities map

In onTopologyChanges

external-provisioner/pkg/capacity/capacity.go

Lines 357 to 359 in e7c0bcc

for _, segment := range added {

c.addWorkItem(segment, sc)

}
In onSCAddOrUpdate

external-provisioner/pkg/capacity/capacity.go

Lines 382 to 384 in e7c0bcc

for _, segment := range segments {

c.addWorkItem(segment, sc)

}

The segments are all returned from c.topologyInformer so I don't believe they will have the same address even if they represent the same segment. And addWorkitem simply checks key existence in this way which does not work when the segments exist as addresses in workItem.

external-provisioner/pkg/capacity/capacity.go

Lines 473 to 476 in e7c0bcc

    
           _, found := c.capacities[item] 
        
           if !found { 
        
           	c.capacities[item] = nil 
        
           }

pohly · 2024-02-22T19:43:10Z

The segments are all returned from c.topologyInformer so I don't believe they will have the same address even if they represent the same segment.

c.topologyInformer should keep segment pointers unique. If it doesn't, then that's the bug.

And addWorkitem simply checks key existence in this way which does not work when the segments exist as addresses in workItem.

Why should this not work? A struct is equal if all of its fields are equal. workItem contains two comparable fields, so it should be comparable:

external-provisioner/pkg/capacity/capacity.go

Lines 102 to 105 in e7c0bcc

    
           type workItem struct { 
        
           	segment          *topology.Segment 
        
           	storageClassName string 
        
           }

Could it also work by using the string value of the segment in workItem?

Yes, that would also be possible. But I expect the string handling to be more expensive than the current pointer comparison.

k8s-triage-robot · 2024-05-22T19:52:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

yuxiang-he mentioned this issue Feb 22, 2024

CSIStorageCapacity: Topology segment not updated #847

Open

yuxiang-he mentioned this issue Feb 22, 2024

Reduce CentralCapacityController workque depth by filtering allowed topologies in storage class #1162

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated Capacity Controller Workqueue Entries #1161

Duplicated Capacity Controller Workqueue Entries #1161

yuxiang-he commented Feb 22, 2024 •

edited

yuxiang-he commented Feb 22, 2024

yuxiang-he commented Feb 22, 2024

pohly commented Feb 22, 2024

yuxiang-he commented Feb 22, 2024 •

edited

yuxiang-he commented Feb 22, 2024

pohly commented Feb 22, 2024

k8s-triage-robot commented May 22, 2024

Duplicated Capacity Controller Workqueue Entries #1161

Duplicated Capacity Controller Workqueue Entries #1161

Comments

yuxiang-he commented Feb 22, 2024 • edited

yuxiang-he commented Feb 22, 2024

yuxiang-he commented Feb 22, 2024

pohly commented Feb 22, 2024

yuxiang-he commented Feb 22, 2024 • edited

yuxiang-he commented Feb 22, 2024

pohly commented Feb 22, 2024

k8s-triage-robot commented May 22, 2024

yuxiang-he commented Feb 22, 2024 •

edited

yuxiang-he commented Feb 22, 2024 •

edited