Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replicate cache hanging on configmaps #142

Open
mvaalexp opened this issue Apr 26, 2022 · 5 comments
Open

replicate cache hanging on configmaps #142

mvaalexp opened this issue Apr 26, 2022 · 5 comments

Comments

@mvaalexp
Copy link

mvaalexp commented Apr 26, 2022

I am using a fairly large cluster with some configmaps that are constantly updating. When I add the --replicate=v1/configmaps flag to kube-mgmt, the watch seems to hang for hours (I am seeing around 3+ hours).

The only conclusion I can come to is that because of the constantly updating configmaps, the ResultChan is not ever returning?

w, err := resource.Watch(context.TODO(), metav1.ListOptions{
ResourceVersion: resourceVersion,
})
if err != nil {
return errKubernetes{fmt.Errorf("watch: %w", err)}
}
defer w.Stop()
ch := w.ResultChan()
for {
select {
case evt := <-ch:
switch evt.Type {
case watch.Added:
err := s.syncAdd(evt.Object)
if err != nil {
return errOPA{fmt.Errorf("add event: %w", err)}
}
case watch.Modified:
err := s.syncAdd(evt.Object)
if err != nil {
return errOPA{fmt.Errorf("modify event: %w", err)}
}
case watch.Deleted:
err := s.syncRemove(evt.Object)
if err != nil {
return errOPA{fmt.Errorf("delete event: %w", err)}
}
case watch.Error:
return errKubernetes{fmt.Errorf("error event: %v", evt.Object)}
default:
return errChannelClosed{}
}
case <-quit:
return nil
}
}

The only other thing I can think is that because of the throttling issue we have on the cluster and the channel is failing out.

I0426 16:36:41.173364   49052 request.go:665] Waited for 1.189022768s due to client-side throttling, not priority and fairness, request: GET:

Any help here would be appreciated, thanks.

@mvaalexp
Copy link
Author

Throttling is not the issue, I was able to replicate this on a local kind cluster without throttling issues.
I can see there is a channel error

time="2022-04-27T17:22:49Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T17:22:53Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 68957. Took 4.4922964s." 
time="2022-04-27T17:22:53Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 304.392ms. Starting watch at resourceVersion 68957." 
time="2022-04-27T18:18:33Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T18:18:33Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T18:18:33Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 85312. Took 17.362ms." 
time="2022-04-27T18:18:33Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 62.9825ms. Starting watch at resourceVersion 85312." 
time="2022-04-27T19:03:56Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T19:03:56Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T19:03:56Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 98147. Took 17.3734ms." 
time="2022-04-27T19:03:56Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 56.3345ms. Starting watch at resourceVersion 98147." 
time="2022-04-27T19:50:35Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T19:50:35Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T19:50:35Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 111433. Took 42.1965ms." 
time="2022-04-27T19:50:35Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 182.5547ms. Starting watch at resourceVersion 111433." 
time="2022-04-27T20:31:46Z" level=info msg="Sync channel for v1/configmaps closed. Restarting immediately." 
time="2022-04-27T20:31:46Z" level=info msg="Syncing v1/configmaps." 
time="2022-04-27T20:31:46Z" level=info msg="Listed v1/configmaps and got 65 resources with resourceVersion 123096. Took 31.6638ms." 
time="2022-04-27T20:31:46Z" level=info msg="Loaded v1/configmaps resources into OPA. Took 144.7732ms. Starting watch at resourceVersion 123096."

@ra-grover
Copy link

We also have the same issue, the container just hung for few days on
time="2022-11-03T17:23:19Z" level=info msg="Loaded 701 resources of kind v1/configmaps into OPA. Took 265.538471ms"

And we are on the latest version 7.1.1.

@tehlers320
Copy link
Contributor

Is this using the wrong informer?

https://github.com/open-policy-agent/kube-mgmt/blob/7.1.1/pkg/data/generic.go#L126

// NewInformer returns a Store and a controller for populating the store
// while also providing event notifications. You should only used the returned
// Store for Get/List operations; Add/Modify/Deletes will cause the event
// notifications to be faulty.

It specifically says only to use this for Get/List

NewSharedIndexInformer says its specifically for listwatcher

// NewSharedIndexInformer creates a new instance for the listwatcher.
// The created informer will not do resyncs if the given
// defaultEventHandlerResyncPeriod is zero.

Here is a cluster with 2000 configmaps and it only ever makes it thru 27 of them to my raw http receiver.

grep 'kubernetes/configmaps' /tmp/config-replicate  |grep -v HTTP |cut -d \: -f2  |sort |uniq -c |sort -nr |head -5
 157  /v1/data/kubernetes/configmaps/test/uno
 157  /v1/data/kubernetes/configmaps/test/dos
 156  /v1/data/kubernetes/configmaps/test/tres
 155  /v1/data/kubernetes/configmaps/test/cuatro
 155  /v1/data/kubernetes/configmaps/test/cinco
grep 'kubernetes/configmaps' /tmp/config-replicate  |grep -v HTTP |cut -d \: -f2  |sort |uniq -c |wc -l
27

I think this is what the authors upstream meant by faulty. The cache state is changing too rapidly and resetting the list?

@tehlers320
Copy link
Contributor

I made a PR based on the client-go comment but i think the whole diagnosis may have been wrong. It looks like perhaps we were just attacking ourselves with leader election configmaps which may have in some way overloaded OPA itself.

As to why we dont see all configmaps load ever... i suspect that the LIST function is never initialized like this: https://github.com/open-policy-agent/kube-mgmt/pull/183/files#diff-10432976b832c2bf005c1a2b317bff34f985023e0e2e43650c2cfd1fbcb1f7faR151-R155

Im leaving the PR in draft because i think this possibly may be a non-issue with kube-mgmt given the very high volume of uploads to OPA that should be eliminated first.

@tehlers320
Copy link
Contributor

I have debug logging on my branch ignoreNamespaces which may explain why we never see loads.

time="2023-02-23T01:37:08Z" level=debug msg="queue length: 873"
time="2023-02-23T01:37:11Z" level=debug msg="queue length: 872"
time="2023-02-23T01:37:13Z" level=debug msg="queue length: 871"
time="2023-02-23T01:37:16Z" level=debug msg="queue length: 870"
time="2023-02-23T01:37:19Z" level=debug msg="queue length: 869"
time="2023-02-23T01:37:22Z" level=debug msg="queue length: 868"
time="2023-02-23T01:37:24Z" level=debug msg="queue length: 867"
time="2023-02-23T01:37:27Z" level=debug msg="queue length: 866"
time="2023-02-23T01:37:29Z" level=debug msg="queue length: 865"
time="2023-02-23T01:37:32Z" level=debug msg="queue length: 864"
time="2023-02-23T01:37:34Z" level=debug msg="queue length: 863"
time="2023-02-23T01:37:36Z" level=debug msg="queue length: 862"

This is with the loader fixed to ignore leader elections + different namespaces that may not have leader elects labeled correctly. You can see here its taking ~2 seconds per request. So if the reloads are NOT bypassed the queue never empties and kube-mgmt falls so far behind it can never finish the queue.

I'm going to link this to #186 as i'm pretty certain this was of our own doing and will work to tidy up whatever i can to make this mergable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants