Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[collector] Unable to remove CPU limits when used as a subchart #644

Closed
verejoel opened this issue Feb 15, 2023 · 15 comments · Fixed by #745
Closed

[collector] Unable to remove CPU limits when used as a subchart #644

verejoel opened this issue Feb 15, 2023 · 15 comments · Fixed by #745
Labels
chart:collector Issue related to opentelemetry-collector helm chart

Comments

@verejoel
Copy link

Using this chart forces one to set a CPU limit due to it not being possible to remove the default keys.

As an example, say I want to have 1 core of cpu requests, but not set a limit. If my values are

# values.yaml
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: null
      memory: 4Gi

this yields the generated manifest:

# manifest.yaml
          resources:
            limits:
              cpu: 256m
              memory: 4Gi
            requests:
              cpu: 1000m
              memory: 1Gi

i.e., the CPU limit is populated by the default value.

This is a known bug in helm: helm/helm#5184

Can we please comment out the default resources in the value file and leave them as an example?

@TylerHelmuth TylerHelmuth added the chart:collector Issue related to opentelemetry-collector helm chart label Feb 15, 2023
@TylerHelmuth
Copy link
Member

@verejoel what version of helm are you using? When install

mode: daemonset
resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: null
      memory: 4Gi

I get

    Limits:
      memory:  4Gi
    Requests:
      cpu:      1
      memory:   1Gi

Are you by any chance using the collector chart as a subchart?

@verejoel
Copy link
Author

verejoel commented Feb 27, 2023

Yes it's deployed as a sub-chart (I updated the title to reflect this), the chart manifest is:

apiVersion: v2
name: collector
version: 1.0.0
description: A Helm chart for deploying Open Telemetry collector as a deployment
type: application
dependencies:
  - name: opentelemetry-collector
    alias: collector
    repository: https://open-telemetry.github.io/opentelemetry-helm-charts
    version: 0.43.5

Is there a strong reason why the chart should explicitly set resource requests/limits? These are almost guaranteed to be different for every particular use-case and require some configuration from the user. The common practice seems to be to set the resources key to an empty map in the chart definition, and let the user decide.

@verejoel verejoel changed the title Unable to remove CPU limits Unable to remove CPU limits when using as a subchart Feb 27, 2023
@verejoel verejoel changed the title Unable to remove CPU limits when using as a subchart [collector] Unable to remove CPU limits when used as a subchart Feb 27, 2023
@TylerHelmuth
Copy link
Member

I believe the original intent was to help protect users from the collector running wild. The default values for this chart are intentionally made to be easily installed into a cluster to see quick value.

Looking through other charts (Prometheus, Jaeger, and Grafana), I do agree that what this chart does is not common; most charts use an empty resources object like you mentioned.

Since this an empty resource object is commonplace and since a fix for the "null doesn't work with subchart" bug isn't done yet I would support updating this chart to follow suit. The default values.yaml will still be installable and usable and we can use comments in the values.yaml to suggest resource values.

@open-telemetry/helm-approvers do you agree?

@dmitryax
Copy link
Member

@verejoel I'm curious what's the goal of removing limits?

@dmitryax
Copy link
Member

We used to configure the memory_limiter based on the limits initially, so that was important to have them in the first place. Now, since the collector can start without providing resources, I'm good with removing them. But I would like to understand the use case first, not just do it because others do.

@verejoel
Copy link
Author

Hi @dmitryax and @TylerHelmuth , thanks for your replies.

The goal is to be able to remove CPU limits when using the collector chart as a sub-chart, but still be able to set memory limits. Unfortunately I am limited to using the chart as a sub-chart by my deployment framework, and I cannot unset the limits any other way in this case.

In our cluster, our nodes have rather under-utilized overall CPU usage, so it is nice not to have to set CPU limits explicitly and let each collector pod use as many cycles as each node has to spare.

@dmitryax
Copy link
Member

In our cluster, our nodes have rather under-utilized overall CPU usage, so it is nice not to have to set CPU limits explicitly and let each collector pod use as many cycles as each node has to spare.

It's a dangerous and discouraged practice. If some custom collector component introduces a memory leak, it can bring down user nodes. IMO this use case doesn't justify having this kind of a change. I'm curious what other https://github.com/orgs/open-telemetry/teams/helm-approvers think

@verejoel
Copy link
Author

Why does memory come into the argument? We are talking about CPU.

@dmitryax
Copy link
Member

dmitryax commented Feb 27, 2023

The suggestion is to remove everything from the resources by default. And similar concerns applied to CPU, even not that critical.

@verejoel
Copy link
Author

Understood. However I believe those kind of concerns are beyond the remit of the Helm Chart maintainers. A helm chart should be general enough that it can be customized to any use-case. We have identified a situation where that is not possible, and the fix is to bring the chart more in-line with other charts by removing the default, arbitrarily chosen limits and entrusting this decision fully to the user.

@TylerHelmuth
Copy link
Member

With the switch to percentages by default for memory_limiter and memory_ballast I don't believe we have a strict requirement on resources being present anymore. Yes it is a best practice to set those values when running the collector in k8s, but technically not a requirement.

This use case aside (and yes I think it is dangerous to run without explicit CPU Limits and I don't encourage that), it does seem like a valid request to be able to be able to manage resources however you please, even in a subchart. I suspect that the sub-chart null bug is the reason many other charts set resources: {} instead of setting default values like we do.

If null worked in subcharts I would vote not to change the defaults. Maybe helm/helm#11440 will get merged in and released making this discussion moot?

@povilasv
Copy link
Contributor

povilasv commented Mar 2, 2023

A couple of thoughts from my side:

Upside of no resources:

Downsides:

  • Personally, I think have some requests and limits is better than none, if a Pod is crashing due too OOM, it's better than Node OOM
  • Sort of against best k8s practices

So in total, I think I am in favour of setting resources to empty object

@dmitryax
Copy link
Member

dmitryax commented Mar 3, 2023

Ok. If most of us think removing the resources is better, let's do it. But we should throw a warning if limits are not set in https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/templates/NOTES.txt . @TylerHelmuth @povilasv, WDTY?

@TylerHelmuth
Copy link
Member

I think that's a great idea.

@povilasv
Copy link
Contributor

povilasv commented Mar 3, 2023

Agree :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment