Reaching timeout when getting KubeObject on big cluster #4616

Calvinaud · 2024-05-03T17:23:49Z

What happened:
Context: Infrastructure at Cluster level on a cluster with important number of namespace/object.

When trying to create a ChaosExperiments on the UI. We cannot select any App namespace after selecting the App Kind. This is due to the query timeout just like in: #4308.
In our case, the timeout we are reaching is the one directly from the browser (1min) since the getKubeObject query is taking too long.

The query is taking too long because our infrastructure is taking too long to answer, it taking around 2-3 for our infrastructure to send back the KubeObject list. This is probably because we have a big cluster and the function getKubeObject is not efficient enough. (for information, this is not due too a lack of resource from request/limit).

What you expected to happen:
The query doesn't timeout and we can select the namespace.

Where can this issue be corrected? (optional)
I think the main point to solve this issue is to have getKubeObject/GetKubernetesObjects https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L27 more efficient/fast.

I think multiple solution is possible (so don't hesitate if you have any better solution)
The first solution that can be possible is to have some parallelism on https://github.com/litmuschaos/litmus/blob/master/chaoscenter/subscriber/pkg/k8s/objects.go#L65. I don't think this is the best solution since it create a risk to DOS the api server with lot of request.

The second solution but seem harder to put in place is to not retrieve all the object of a type in every namespace at once but to separate it in two query:

the first to retrieve only the namespace list when an object type is selected.
then when the user is selecting a namespace retrieve the list of object/labels only for the selected namespace.

How to reproduce it (as minimally and precisely as possible):

Create a cluster with important number of namespace/object.
Install chaoscenter/infrastructure at cluster level.
Try to create a ChaosExperiments
(To reproduce more easily you can include some delay in the response from the API Server)

Anything else we need to know?:
Two question related to this topic:

Shouldn't the order of selected field be changed to first be namespace then object type then object name ? (I think it's not really necessary the only gain is you could only show the object type present in the namespace)
Wouldn't a "free text" option still be available on the UI to be able to create a ChaosExperiment that target a object that is not deployed ? (I know it's still possible to do it in the yaml but not really user friendly).

The text was updated successfully, but these errors were encountered:

smitthakkar96 · 2024-05-22T16:23:36Z

It would be nice to create a metadata-only informer to speed lookups for Labels or use cached client in subscriber. This should speed up the lookups

https://firehydrant.com/blog/dynamic-kubernetes-informers/
https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332

smitthakkar96 · 2024-05-22T16:25:39Z

On a side note to get better visibility on performance and stability issues would be great to kickoff efforts on having instrumentation using https://opentelemetry.io/ (logs, metrics, traces and profiles). This would allow users to monitor components of Litmus on their own O11y stack, gain insights on scaling and stability challenges and can report/contribute with more data.

Calvinaud added the bug label May 3, 2024

Calvinaud linked a pull request May 30, 2024 that will close this issue

Seperate call to retrieve namespace list and retrieve KubeObject for only a single namespace #4680

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reaching timeout when getting KubeObject on big cluster #4616

Reaching timeout when getting KubeObject on big cluster #4616

Calvinaud commented May 3, 2024

smitthakkar96 commented May 22, 2024

smitthakkar96 commented May 22, 2024 •

edited

Reaching timeout when getting KubeObject on big cluster #4616

Reaching timeout when getting KubeObject on big cluster #4616

Comments

Calvinaud commented May 3, 2024

smitthakkar96 commented May 22, 2024

smitthakkar96 commented May 22, 2024 • edited

smitthakkar96 commented May 22, 2024 •

edited