Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rlimit support #3595

Open
thockin opened this issue Jan 18, 2015 · 170 comments
Open

rlimit support #3595

thockin opened this issue Jan 18, 2015 · 170 comments
Labels
area/isolation kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@thockin
Copy link
Member

thockin commented Jan 18, 2015

moby/moby#4717 (comment)

Now that this is in, we should define how we want to use it.

@bgrant0607 bgrant0607 added area/isolation area/kubelet-api area/api Indicates an issue on api area. labels Jan 23, 2015
@bgrant0607
Copy link
Member

/cc @vishh @rjnagal @vmarmol

@rjnagal
Copy link
Contributor

rjnagal commented Jan 23, 2015

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@thockin
Copy link
Member Author

thockin commented Jan 23, 2015

Are there any downsides to setting high limits by default these days? I
can't keep straight what bugs we have fixed internally that might not have
been accepted upstream, especially regarding things like memcg accounting
of kernel structs.

On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal notifications@github.com
wrote:

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651>

.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

@vmarmol
Copy link
Contributor

vmarmol commented Jan 23, 2015

+1 to toggle, putting it in the spec is overkill IMO.

@thockin
Copy link
Member Author

thockin commented Jan 23, 2015

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

@vishh
Copy link
Contributor

vishh commented Jan 23, 2015

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<
#3595 (comment)

.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@rjnagal
Copy link
Contributor

rjnagal commented Jan 23, 2015

One way to restrict "many" would be to take into account the global machine
limits and use it in scheduling.
I don't think we have or are planning to add user-based capabilities.

On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan notifications@github.com
wrote:

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of
the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com

wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<

#3595 (comment)

.


Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393>

.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@goltermann goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 28, 2015
@davidopp davidopp added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 17, 2015
@timcash
Copy link

timcash commented Mar 4, 2015

For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful)

@timothysc
Copy link
Member

+1 to toggle, but what exactly does few and many mean?
Also what are the implications to scheduling?

@bgrant0607
Copy link
Member

I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets.

@rjnagal
Copy link
Contributor

rjnagal commented Mar 6, 2015

I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model.

For large and few values, we can start with typical linux max for the resource as large, and typical default as few.

@bgrant0607 what kind of model did you have in mind for representing these as resources?

@bgrant0607
Copy link
Member

I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

@thockin
Copy link
Member Author

thockin commented Mar 7, 2015

small and large feels clunky, but I like the ability for site-admins to
define a few grades of service and then to let users choose. I think it
works pretty well internally - at least most users survive with the default
of "small"

On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant notifications@github.com
wrote:

I don't know that we need to track these values in the scheduler. They are
more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of
numerical values would make it difficult for users to predict what category
they should request, and the choice might not even be portable and/or
stable over time. Do you think users wouldn't know how many file
descriptors to request, for example? That seems like it can be computed
with simple arithmetic based on the number of clients one wants to support,
for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.


Reply to this email directly or view it on GitHub
#3595 (comment)
.

@bgrant0607
Copy link
Member

Re. admin-defined policies, see moby/moby#11187

@vishh
Copy link
Contributor

vishh commented Aug 10, 2015

@bgrant0607: Is this something that we can consider for v1.1?

@bgrant0607
Copy link
Member

We can, but I'll have 0 bandwidth to think about it for the next month, probably.

cc @erictune

@dchen1107
Copy link
Member

Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means

  • The limit is applied to container's root process, and all children processes inherit it
  • There is no control on how many children processes
  • Processes for docker exec are not inheriting the same limit

Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage.

@shaylevi2
Copy link

Where does this stand? is it available through any config?

@vishh
Copy link
Contributor

vishh commented Nov 11, 2015

@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need.

@lknite
Copy link

lknite commented Apr 14, 2023

Deploying ha-argocd helm chart uses a subchart -> ha-redis helm chart which uses -> haproxy, on a kubernetes cluster running redhat 9 experiences max memory and cpu in its haproxy pods which nearly instantly crash OOMKilled.
My issue filed at argocd helm chart: argoproj/argo-helm#1958
The apparently identified issue at haproxy: docker-library/haproxy#194

Am working to set at the OS level, but would have expected to be able to set a default within kubernetes. For kubernetes to have already set a reasonable default, as well as the ability to set at the namespace level via resourcequota and maybe at the pod level with resources.ulimit or something similar. Since I'm using a helm chart which includes a helm chart passing parameters to the end application might require helm charts to be updated, seems reasonable to set at a higher level where I have more control.

Looking for the "right way" to set ulimit w/ kubernetes in my cluster running via redhat 9. If its something to set at the OS level I would have expected the setting to be in the kubernetes installation document.

Based on comment above was able to fix with:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>
  • Additional note: After a yum update I found this fix to have reset and the pods started crashing the nodes again. It was a quick fix once identified. Something to keep in mind.

@ddmunhoz
Copy link

ddmunhoz commented Jun 1, 2023

c'mon guys, this a a basic Linux feature. Can we get around moving this one ? For sure I could go a number of ways to fix it with an init container that sets it, but having it officially supported is the way to go.

@evie404
Copy link
Contributor

evie404 commented Jun 7, 2023

We recently had an incident related to process limits on our clusters and remembered this opened issue. I might be interested in pushing this forward assuming I can get time allocated from work.

I previously implemented API and kubelet change for a Pod feature (#44641), so I assume the moving pieces are similar. Would the next step be submitting a KEP? and if so, would this be more for SIG Scheduling or SIG Node or another SIG?

There's also a new complication since the issue was first posted. Docker may support rlimits but we'd have to make sure it's also supposed for other container runtimes.

@thockin
Copy link
Member Author

thockin commented Jun 7, 2023

SIG Node for sure.

Before writing a full KEP, I always recommend that people write a simple Google Doc, which lays out the key parts of the KEP (problem statement, major design points, tradeoffs, alternative options). It's easier to quick-iterate thst way.

I previously implemented API and kubelet change for a Pod feature

welcome back :)

@dims
Copy link
Member

dims commented Jun 8, 2023

I might be interested in pushing this forward assuming I can get time allocated from work.

Yes please @evie404 thanks for stepping up again!

@kfox1111
Copy link

We just hit this issue too.

@aojea
Copy link
Member

aojea commented Nov 16, 2023

/cc

@PJGuedes
Copy link

\cc

@schrej
Copy link
Member

schrej commented Nov 24, 2023

There is a new "Subscribe" button in the "notifications" section of the sidebar to the right of the issue description that can be used to get updates about an issue without writing a comment.
I'm sure it would be appreciated by many if anyone who is interested uses that in the future instead of writing comments that don't really contribute to the discussion. I always find it disappointing to open a notification just to read a "+1" or "cc".

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2024
@smst329
Copy link

smst329 commented Mar 6, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2024
@smst329
Copy link

smst329 commented Mar 6, 2024

Still needed and k8s default limits are far far below what a modern linux host can handle. Should either be controllable or would be nice if the default was larger.

@palonsoro
Copy link

@smst329 if you want to change the default ulimits, some container engines allow this at the container engine level.

For example, cri-o allows setting this via default_ulimits setting at the crio.runtime section of crio.conf.

However, while this would save the use case for default larger rlimits, it would still be good to be able to properly set this per-container, so we don't have to just grant high limits for everybody only so that a small subset of applications that require it can work.

@polarathene
Copy link

so we don't have to just grant high limits for everybody only so that a small subset of applications that require it can work

For most software I think it should be ok to default to a soft limit of 1024 with a much higher hard limit. Usually the hard limit is 524288 with systemd, but for a k8s deployment with some services like Envoy that needs to be much higher. This is regarding RLIMIT_NOFILE specifically, while others have cited needing to adjust other limits above.

This won't help software like Envoy however as they don't properly implement support to raise their soft limit internally, which they should since they advise setting it to the max. Other software however can regress when the inherited soft limit is raised higher than it should be.

It would probably be good for those that need the higher limits to mention the software they need it for and how high they need that to be.


Are there any downsides to setting high limits by default these days?
I can't keep straight what bugs we have fixed internally that might not have been accepted upstream, especially regarding things like memcg accounting of kernel structs.

@thockin There are downsides for setting high soft limits which can introduce regressions. Thus changing that would just break software for a different group of users.

It needs to be configurable, although you may be able to strike a compromise (eg for RLIMIT_NOFILE) that minimizes the regression while raising the limit higher for others, some software may expect a soft limit of 1024 though and malfunction (see this reference for examples).

AFAIK memcg accounting concerns were addressed some time ago, so that shouldn't have any problems.

@kfox1111
Copy link

kfox1111 commented Mar 6, 2024

Yeah, we hit a problem with limits being raised too high by default in the runtime and workloads started oom'ing after k8s upgrade as they had memory limits set and needed signifcantly more memory allocated to deal with all the extra fd's it decided to freak out about. A smilar issue is hit by processes that start and then try to close all fd's. They can take significantly longer to startup if granted too many. But some software really does need lots of fd's. So it does need to be configurable for certain workloads and set low for the rest.

@polarathene
Copy link

workloads started oom'ing after k8s upgrade as they had memory limits set and needed signifcantly more memory allocated to deal with all the extra fd's it decided to freak out about.

That is one of the known regressions that can happen with some software like with Java IIRC, where the range of FDs is pre-allocated in an array during init, which when excessive will use many GB of memory. MySQL was also known for this but I have heard since fixed.

It should only occur if the soft limit is set too high, some software (like anything with Go 1.20 IIRC) will implicitly raise the soft limit to the hard limit if it deems it safe to do so, originally it was problematic since it didn't restore the soft limit for child processes which did not have the same safety, but that's since been fixed.


A similar issue is hit by processes that start and then try to close all fd's.

That one is also due to excessive soft limit. The software is usually a daemon service that is doing a best practice to close the FDs during init and on non-container environments you wouldn't have a soft limit mistakenly set to excessive levels (infinity aka 2^30).

Some have adapted with improved approaches to this step that aren't affected, but those are more sensitive to the environment as they either have a more specific expectation of the platform (iterating /proc/self/fd) or using newer syscalls that raise the min supported kernel (which is also platform specific).


They can take significantly longer to startup if granted too many.

Yes, wastes CPU and I've seen it slow down processes by 10 minutes or so. Worse is when it affects other software operations like package managers, dnf is one IIRC and PowerDNS building a Docker image was reported to require many hours to build whereas it is much quicker when the soft limit is what it should be by default (1024).

But some software really does need lots of fd's. So it does need to be configurable for certain workloads and set low for the rest.

They need to document that (some don't like Envoy), and ideally they handle that internally to raise the soft limit for the process(es) they need to, which can either be to the hard limit or configurable by that software (as nginx and others support when the raised soft limit should be restricted in scope to avoid issues).

@kfox1111
Copy link

kfox1111 commented Mar 7, 2024

Being able to set a low soft and hard limit can limit its abuse for those pods that don't have any need for it (most pods), so configurable per pod would still be good.

@oboote
Copy link

oboote commented Mar 8, 2024

Also running into this issue of really high ulimit on the new AWS EKS Linux 2023 AMI's.

Been tearing my hair out all day trying to set limits for rabbitmq and mysql pods. mysql won't start and rabbitmq is consuming 15x the usual amount of memory.

I've tried using:

  • a sidecar container to set ulimits but that doesn't seem to carry over.
  • commands in the manifest to set ulimit then call entrypoint.sh
  • postStart lifecycle to set ulimits
  • mysql init docker-entrypoint-initdb.d scripts to set ulimit

I've even set limitNOFILE=65565 on containerd conf with daemon-reload and restart containerd in user-data to apply it globally on the worker nodes...

Yet if I exec into pods on those nodes - I still get a 2^30 ulimit...

I'm completely at a loss - going to revert back to AL2 AMI's until I can think of something else...

@mattbrandman
Copy link

@oboote had the exact same AL2023 issue and have no idea what to do. I had the issue with redis HAProxy chart

@mateusz-kolecki
Copy link

Have the same issue here with Varnish 6.1 on Debian 11 images. Since closefrom() is not available on Debian 11, Varnish spends 10 minutes closing all available fds in a loop. Future versions are smart enough to check what actually is open via inspecting /proc/${pid}/fd/ but I'm stuck with 6.1 and trying to figure out some solution. Having control over ulimit per container would help a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/isolation kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
Technical Debt Research
Enterprise Readiness
Development

No branches or pull requests