replicate cache hanging on configmaps #142

mvaalexp · 2022-04-26T22:41:06Z

I am using a fairly large cluster with some configmaps that are constantly updating. When I add the --replicate=v1/configmaps flag to kube-mgmt, the watch seems to hang for hours (I am seeing around 3+ hours).

The only conclusion I can come to is that because of the constantly updating configmaps, the ResultChan is not ever returning?

kube-mgmt/pkg/data/generic.go

Lines 185 to 223 in 6da0e7e

    
           w, err := resource.Watch(context.TODO(), metav1.ListOptions{ 
        
           	ResourceVersion: resourceVersion, 
        
           }) 
        
           if err != nil { 
        
           	return errKubernetes{fmt.Errorf("watch: %w", err)} 
        
           } 
        
           defer w.Stop() 
        
           ch := w.ResultChan() 
        
           for { 
        
           	select { 
        
           	case evt := <-ch: 
        
           		switch evt.Type { 
        
           		case watch.Added: 
        
           			err := s.syncAdd(evt.Object) 
        
           			if err != nil { 
        
           				return errOPA{fmt.Errorf("add event: %w", err)} 
        
           			} 
        
           		case watch.Modified: 
        
           			err := s.syncAdd(evt.Object) 
        
           			if err != nil { 
        
           				return errOPA{fmt.Errorf("modify event: %w", err)} 
        
           			} 
        
           		case watch.Deleted: 
        
           			err := s.syncRemove(evt.Object) 
        
           			if err != nil { 
        
           				return errOPA{fmt.Errorf("delete event: %w", err)} 
        
           			} 
        
           		case watch.Error: 
        
           			return errKubernetes{fmt.Errorf("error event: %v", evt.Object)} 
        
           		default: 
        
           			return errChannelClosed{} 
        
           		} 
        
           	case <-quit: 
        
           		return nil 
        
           	} 
        
           }

The only other thing I can think is that because of the throttling issue we have on the cluster and the channel is failing out.

I0426 16:36:41.173364   49052 request.go:665] Waited for 1.189022768s due to client-side throttling, not priority and fairness, request: GET:

Any help here would be appreciated, thanks.

The text was updated successfully, but these errors were encountered:

mvaalexp · 2022-04-28T14:14:02Z

Throttling is not the issue, I was able to replicate this on a local kind cluster without throttling issues.
I can see there is a channel error

time="2022-04-27T17:22:49Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T17:22:53Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 68957. Took 4.4922964s." 
time="2022-04-27T17:22:53Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 304.392ms. Starting watch at resourceVersion 68957." 
time="2022-04-27T18:18:33Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T18:18:33Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T18:18:33Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 85312. Took 17.362ms." 
time="2022-04-27T18:18:33Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 62.9825ms. Starting watch at resourceVersion 85312." 
time="2022-04-27T19:03:56Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T19:03:56Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T19:03:56Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 98147. Took 17.3734ms." 
time="2022-04-27T19:03:56Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 56.3345ms. Starting watch at resourceVersion 98147." 
time="2022-04-27T19:50:35Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T19:50:35Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T19:50:35Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 111433. Took 42.1965ms." 
time="2022-04-27T19:50:35Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 182.5547ms. Starting watch at resourceVersion 111433." 
time="2022-04-27T20:31:46Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T20:31:46Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T20:31:46Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 123096. Took 31.6638ms." 
time="2022-04-27T20:31:46Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 144.7732ms. Starting watch at resourceVersion 123096."

ra-grover · 2022-11-09T22:47:22Z

We also have the same issue, the container just hung for few days on
time="2022-11-03T17:23:19Z" level=info msg="Loaded 701 resources of kind v1/configmaps into OPA. Took 265.538471ms"

And we are on the latest version 7.1.1.

tehlers320 · 2023-02-08T04:30:41Z

Is this using the wrong informer?

https://github.com/open-policy-agent/kube-mgmt/blob/7.1.1/pkg/data/generic.go#L126

// NewInformer returns a Store and a controller for populating the store
// while also providing event notifications. You should only used the returned
// Store for Get/List operations; Add/Modify/Deletes will cause the event
// notifications to be faulty.

It specifically says only to use this for Get/List

NewSharedIndexInformer says its specifically for listwatcher

// NewSharedIndexInformer creates a new instance for the listwatcher.
// The created informer will not do resyncs if the given
// defaultEventHandlerResyncPeriod is zero.

Here is a cluster with 2000 configmaps and it only ever makes it thru 27 of them to my raw http receiver.

grep 'kubernetes/configmaps' /tmp/config-replicate  |grep -v HTTP |cut -d \: -f2  |sort |uniq -c |sort -nr |head -5
 157  /v1/data/kubernetes/configmaps/test/uno
 157  /v1/data/kubernetes/configmaps/test/dos
 156  /v1/data/kubernetes/configmaps/test/tres
 155  /v1/data/kubernetes/configmaps/test/cuatro
 155  /v1/data/kubernetes/configmaps/test/cinco

grep 'kubernetes/configmaps' /tmp/config-replicate  |grep -v HTTP |cut -d \: -f2  |sort |uniq -c |wc -l
27

I think this is what the authors upstream meant by faulty. The cache state is changing too rapidly and resetting the list?

tehlers320 · 2023-02-08T05:58:25Z

I made a PR based on the client-go comment but i think the whole diagnosis may have been wrong. It looks like perhaps we were just attacking ourselves with leader election configmaps which may have in some way overloaded OPA itself.

As to why we dont see all configmaps load ever... i suspect that the LIST function is never initialized like this: https://github.com/open-policy-agent/kube-mgmt/pull/183/files#diff-10432976b832c2bf005c1a2b317bff34f985023e0e2e43650c2cfd1fbcb1f7faR151-R155

Im leaving the PR in draft because i think this possibly may be a non-issue with kube-mgmt given the very high volume of uploads to OPA that should be eliminated first.

tehlers320 · 2023-02-23T01:41:26Z

I have debug logging on my branch ignoreNamespaces which may explain why we never see loads.

time="2023-02-23T01:37:08Z" level=debug msg="queue length: 873"
time="2023-02-23T01:37:11Z" level=debug msg="queue length: 872"
time="2023-02-23T01:37:13Z" level=debug msg="queue length: 871"
time="2023-02-23T01:37:16Z" level=debug msg="queue length: 870"
time="2023-02-23T01:37:19Z" level=debug msg="queue length: 869"
time="2023-02-23T01:37:22Z" level=debug msg="queue length: 868"
time="2023-02-23T01:37:24Z" level=debug msg="queue length: 867"
time="2023-02-23T01:37:27Z" level=debug msg="queue length: 866"
time="2023-02-23T01:37:29Z" level=debug msg="queue length: 865"
time="2023-02-23T01:37:32Z" level=debug msg="queue length: 864"
time="2023-02-23T01:37:34Z" level=debug msg="queue length: 863"
time="2023-02-23T01:37:36Z" level=debug msg="queue length: 862"

This is with the loader fixed to ignore leader elections + different namespaces that may not have leader elects labeled correctly. You can see here its taking ~2 seconds per request. So if the reloads are NOT bypassed the queue never empties and kube-mgmt falls so far behind it can never finish the queue.

I'm going to link this to #186 as i'm pretty certain this was of our own doing and will work to tidy up whatever i can to make this mergable.

tehlers320 mentioned this issue Feb 8, 2023

change to sharedInformer #183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replicate cache hanging on configmaps #142

replicate cache hanging on configmaps #142

mvaalexp commented Apr 26, 2022 •

edited

mvaalexp commented Apr 28, 2022

ra-grover commented Nov 9, 2022

tehlers320 commented Feb 8, 2023

tehlers320 commented Feb 8, 2023

tehlers320 commented Feb 23, 2023

replicate cache hanging on configmaps #142

replicate cache hanging on configmaps #142

Comments

mvaalexp commented Apr 26, 2022 • edited

mvaalexp commented Apr 28, 2022

ra-grover commented Nov 9, 2022

tehlers320 commented Feb 8, 2023

tehlers320 commented Feb 8, 2023

tehlers320 commented Feb 23, 2023

mvaalexp commented Apr 26, 2022 •

edited