rlimit support #3595

thockin · 2015-01-18T04:46:37Z

Now that this is in, we should define how we want to use it.

bgrant0607 · 2015-01-23T08:18:35Z

rjnagal · 2015-01-23T08:29:28Z

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

—
Reply to this email directly or view it on GitHub
#3595 (comment)
.

thockin · 2015-01-23T17:10:19Z

Are there any downsides to setting high limits by default these days? I
can't keep straight what bugs we have fixed internally that might not have
been accepted upstream, especially regarding things like memcg accounting
of kernel structs.

On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal notifications@github.com
wrote:

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651>

.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

vmarmol · 2015-01-23T17:38:53Z

+1 to toggle, putting it in the spec is overkill IMO.

thockin · 2015-01-23T17:56:18Z

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
#3595 (comment)
.

vishh · 2015-01-23T18:25:13Z

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<
#3595 (comment)

.

—
Reply to this email directly or view it on GitHub
#3595 (comment)
.

rjnagal · 2015-01-23T18:35:20Z

One way to restrict "many" would be to take into account the global machine
limits and use it in scheduling.
I don't think we have or are planning to add user-based capabilities.

On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan notifications@github.com
wrote:

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of
the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com

wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<

#3595 (comment)

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393>

.

—
Reply to this email directly or view it on GitHub
#3595 (comment)
.

timcash · 2015-03-04T15:52:09Z

For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful)

timothysc · 2015-03-04T21:41:54Z

+1 to toggle, but what exactly does few and many mean?
Also what are the implications to scheduling?

bgrant0607 · 2015-03-06T06:11:19Z

I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets.

rjnagal · 2015-03-06T18:21:59Z

I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model.

For large and few values, we can start with typical linux max for the resource as large, and typical default as few.

@bgrant0607 what kind of model did you have in mind for representing these as resources?

bgrant0607 · 2015-03-07T05:12:46Z

I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

thockin · 2015-03-07T05:25:37Z

small and large feels clunky, but I like the ability for site-admins to
define a few grades of service and then to let users choose. I think it
works pretty well internally - at least most users survive with the default
of "small"

On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant notifications@github.com
wrote:

I don't know that we need to track these values in the scheduler. They are
more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of
numerical values would make it difficult for users to predict what category
they should request, and the choice might not even be portable and/or
stable over time. Do you think users wouldn't know how many file
descriptors to request, for example? That seems like it can be computed
with simple arithmetic based on the number of clients one wants to support,
for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

—
Reply to this email directly or view it on GitHub
#3595 (comment)
.

bgrant0607 · 2015-04-16T22:50:14Z

Re. admin-defined policies, see moby/moby#11187

vishh · 2015-08-10T17:27:46Z

@bgrant0607: Is this something that we can consider for v1.1?

bgrant0607 · 2015-08-10T20:01:37Z

We can, but I'll have 0 bandwidth to think about it for the next month, probably.

cc @erictune

dchen1107 · 2015-09-23T00:35:41Z

Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means

The limit is applied to container's root process, and all children processes inherit it
There is no control on how many children processes
Processes for docker exec are not inheriting the same limit

Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage.

shaylevi2 · 2015-11-10T23:40:37Z

Where does this stand? is it available through any config?

vishh · 2015-11-11T19:03:40Z

@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need.

lknite · 2023-04-14T17:05:13Z

Deploying ha-argocd helm chart uses a subchart -> ha-redis helm chart which uses -> haproxy, on a kubernetes cluster running redhat 9 experiences max memory and cpu in its haproxy pods which nearly instantly crash OOMKilled.
My issue filed at argocd helm chart: argoproj/argo-helm#1958
The apparently identified issue at haproxy: docker-library/haproxy#194

Am working to set at the OS level, but would have expected to be able to set a default within kubernetes. For kubernetes to have already set a reasonable default, as well as the ability to set at the namespace level via resourcequota and maybe at the pod level with resources.ulimit or something similar. Since I'm using a helm chart which includes a helm chart passing parameters to the end application might require helm charts to be updated, seems reasonable to set at a higher level where I have more control.

Looking for the "right way" to set ulimit w/ kubernetes in my cluster running via redhat 9. If its something to set at the OS level I would have expected the setting to be in the kubernetes installation document.

Based on comment above was able to fix with:

# sed -i 's/LimitNOFILE=infinity/LimitNOFILE=65535/' /usr/lib/systemd/system/containerd.service
# systemctl daemon-reload
# systemctl restart containerd
# k delete deployment <asdf>

Additional note: After a yum update I found this fix to have reset and the pods started crashing the nodes again. It was a quick fix once identified. Something to keep in mind.

ddmunhoz · 2023-06-01T15:28:52Z

c'mon guys, this a a basic Linux feature. Can we get around moving this one ? For sure I could go a number of ways to fix it with an init container that sets it, but having it officially supported is the way to go.

evie404 · 2023-06-07T18:24:45Z

We recently had an incident related to process limits on our clusters and remembered this opened issue. I might be interested in pushing this forward assuming I can get time allocated from work.

I previously implemented API and kubelet change for a Pod feature (#44641), so I assume the moving pieces are similar. Would the next step be submitting a KEP? and if so, would this be more for SIG Scheduling or SIG Node or another SIG?

There's also a new complication since the issue was first posted. Docker may support rlimits but we'd have to make sure it's also supposed for other container runtimes.

thockin · 2023-06-07T21:37:41Z

SIG Node for sure.

Before writing a full KEP, I always recommend that people write a simple Google Doc, which lays out the key parts of the KEP (problem statement, major design points, tradeoffs, alternative options). It's easier to quick-iterate thst way.

I previously implemented API and kubelet change for a Pod feature

welcome back :)

dims · 2023-06-08T14:14:49Z

I might be interested in pushing this forward assuming I can get time allocated from work.

Yes please @evie404 thanks for stepping up again!

kfox1111 · 2023-10-19T20:13:44Z

We just hit this issue too.

aojea · 2023-11-16T14:28:23Z

/cc

PJGuedes · 2023-11-24T11:05:48Z

\cc

schrej · 2023-11-24T15:56:30Z

There is a new "Subscribe" button in the "notifications" section of the sidebar to the right of the issue description that can be used to get updates about an issue without writing a comment.
I'm sure it would be appreciated by many if anyone who is interested uses that in the future instead of writing comments that don't really contribute to the discussion. I always find it disappointing to open a notification just to read a "+1" or "cc".

k8s-triage-robot · 2024-02-22T16:05:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

smst329 · 2024-03-06T01:40:42Z

/remove-lifecycle stale

smst329 · 2024-03-06T01:44:35Z

Still needed and k8s default limits are far far below what a modern linux host can handle. Should either be controllable or would be nice if the default was larger.

palonsoro · 2024-03-06T11:37:55Z

@smst329 if you want to change the default ulimits, some container engines allow this at the container engine level.

For example, cri-o allows setting this via default_ulimits setting at the crio.runtime section of crio.conf.

However, while this would save the use case for default larger rlimits, it would still be good to be able to properly set this per-container, so we don't have to just grant high limits for everybody only so that a small subset of applications that require it can work.

polarathene · 2024-03-06T22:32:45Z

so we don't have to just grant high limits for everybody only so that a small subset of applications that require it can work

For most software I think it should be ok to default to a soft limit of 1024 with a much higher hard limit. Usually the hard limit is 524288 with systemd, but for a k8s deployment with some services like Envoy that needs to be much higher. This is regarding RLIMIT_NOFILE specifically, while others have cited needing to adjust other limits above.

This won't help software like Envoy however as they don't properly implement support to raise their soft limit internally, which they should since they advise setting it to the max. Other software however can regress when the inherited soft limit is raised higher than it should be.

It would probably be good for those that need the higher limits to mention the software they need it for and how high they need that to be.

Are there any downsides to setting high limits by default these days?
I can't keep straight what bugs we have fixed internally that might not have been accepted upstream, especially regarding things like memcg accounting of kernel structs.

@thockin There are downsides for setting high soft limits which can introduce regressions. Thus changing that would just break software for a different group of users.

It needs to be configurable, although you may be able to strike a compromise (eg for RLIMIT_NOFILE) that minimizes the regression while raising the limit higher for others, some software may expect a soft limit of 1024 though and malfunction (see this reference for examples).

AFAIK memcg accounting concerns were addressed some time ago, so that shouldn't have any problems.

kfox1111 · 2024-03-06T22:45:50Z

Yeah, we hit a problem with limits being raised too high by default in the runtime and workloads started oom'ing after k8s upgrade as they had memory limits set and needed signifcantly more memory allocated to deal with all the extra fd's it decided to freak out about. A smilar issue is hit by processes that start and then try to close all fd's. They can take significantly longer to startup if granted too many. But some software really does need lots of fd's. So it does need to be configurable for certain workloads and set low for the rest.

polarathene · 2024-03-07T00:22:18Z

workloads started oom'ing after k8s upgrade as they had memory limits set and needed signifcantly more memory allocated to deal with all the extra fd's it decided to freak out about.

That is one of the known regressions that can happen with some software like with Java IIRC, where the range of FDs is pre-allocated in an array during init, which when excessive will use many GB of memory. MySQL was also known for this but I have heard since fixed.

It should only occur if the soft limit is set too high, some software (like anything with Go 1.20 IIRC) will implicitly raise the soft limit to the hard limit if it deems it safe to do so, originally it was problematic since it didn't restore the soft limit for child processes which did not have the same safety, but that's since been fixed.

A similar issue is hit by processes that start and then try to close all fd's.

That one is also due to excessive soft limit. The software is usually a daemon service that is doing a best practice to close the FDs during init and on non-container environments you wouldn't have a soft limit mistakenly set to excessive levels (infinity aka 2^30).

Some have adapted with improved approaches to this step that aren't affected, but those are more sensitive to the environment as they either have a more specific expectation of the platform (iterating /proc/self/fd) or using newer syscalls that raise the min supported kernel (which is also platform specific).

They can take significantly longer to startup if granted too many.

Yes, wastes CPU and I've seen it slow down processes by 10 minutes or so. Worse is when it affects other software operations like package managers, dnf is one IIRC and PowerDNS building a Docker image was reported to require many hours to build whereas it is much quicker when the soft limit is what it should be by default (1024).

But some software really does need lots of fd's. So it does need to be configurable for certain workloads and set low for the rest.

They need to document that (some don't like Envoy), and ideally they handle that internally to raise the soft limit for the process(es) they need to, which can either be to the hard limit or configurable by that software (as nginx and others support when the raised soft limit should be restricted in scope to avoid issues).

kfox1111 · 2024-03-07T00:36:09Z

Being able to set a low soft and hard limit can limit its abuse for those pods that don't have any need for it (most pods), so configurable per pod would still be good.

oboote · 2024-03-08T10:29:24Z

Also running into this issue of really high ulimit on the new AWS EKS Linux 2023 AMI's.

Been tearing my hair out all day trying to set limits for rabbitmq and mysql pods. mysql won't start and rabbitmq is consuming 15x the usual amount of memory.

I've tried using:

a sidecar container to set ulimits but that doesn't seem to carry over.
commands in the manifest to set ulimit then call entrypoint.sh
postStart lifecycle to set ulimits
mysql init docker-entrypoint-initdb.d scripts to set ulimit

I've even set limitNOFILE=65565 on containerd conf with daemon-reload and restart containerd in user-data to apply it globally on the worker nodes...

Yet if I exec into pods on those nodes - I still get a 2^30 ulimit...

I'm completely at a loss - going to revert back to AL2 AMI's until I can think of something else...

mattbrandman · 2024-04-03T17:56:56Z

@oboote had the exact same AL2023 issue and have no idea what to do. I had the issue with redis HAProxy chart

mateusz-kolecki · 2024-04-10T22:18:34Z

Have the same issue here with Varnish 6.1 on Debian 11 images. Since closefrom() is not available on Debian 11, Varnish spends 10 minutes closing all available fds in a loop. Future versions are smart enough to check what actually is open via inspecting /proc/${pid}/fd/ but I'm stuck with 6.1 and trying to figure out some solution. Having control over ulimit per container would help a lot.

bgrant0607 added area/isolation area/kubelet-api area/api Indicates an issue on api area. labels Jan 23, 2015

goltermann added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Jan 28, 2015

davidopp added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 17, 2015

bgrant0607 mentioned this issue Feb 28, 2015

Take advantage of Docker ulimit support once it's merged #4911

Closed

erictune mentioned this issue Mar 4, 2015

ulimit support from docker. #5042

Closed

dchen1107 added the team/ux label Aug 10, 2015

nak3 mentioned this issue Sep 4, 2015

Ulimit support openshift/origin#4517

Closed

vishh mentioned this issue Sep 9, 2015

Pods that fail health checks always restarting on the same minion instead of others? #13385

Closed

pires mentioned this issue Nov 20, 2015

mlockall & capabilities not working fabric8io/elasticsearch-cloud-kubernetes#21

Closed

qiutongs mentioned this issue Feb 18, 2023

cri: Support configuration of process RLIMIT_NOFILE limits containerd/containerd#6063

Open

artem-zinnatullin mentioned this issue Mar 22, 2023

Is there a way to set the default-ulimits? containerd/containerd#3150

Closed

mmguero mentioned this issue Mar 24, 2023

support Malcolm deployment with Kubernetes idaholab/Malcolm#149

Closed

mmguero mentioned this issue Apr 13, 2023

kubernetes deployment - opensearch idaholab/Malcolm#170

Closed

bdevcich mentioned this issue Jun 6, 2023

Investigate Copy Offload Create Request overhead NearNodeFlash/nnf-dm#88

Closed

redbaron mentioned this issue Aug 21, 2023

Allow running in mlock-less environment influxdata/telegraf#13804

Closed

champtar mentioned this issue Jan 22, 2024

Remove LimitNOFILE from containerd.service containerd/containerd#8924

Merged

polarathene mentioned this issue Jan 23, 2024

Set LimitNOFILE=1024:524288 for crio.service cri-o/cri-o#7703

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2024

rlimit support #3595

rlimit support #3595

Comments

thockin commented Jan 18, 2015

bgrant0607 commented Jan 23, 2015

rjnagal commented Jan 23, 2015

thockin commented Jan 23, 2015

vmarmol commented Jan 23, 2015

thockin commented Jan 23, 2015

vishh commented Jan 23, 2015

rjnagal commented Jan 23, 2015

timcash commented Mar 4, 2015

timothysc commented Mar 4, 2015

bgrant0607 commented Mar 6, 2015

rjnagal commented Mar 6, 2015

bgrant0607 commented Mar 7, 2015

thockin commented Mar 7, 2015

bgrant0607 commented Apr 16, 2015

vishh commented Aug 10, 2015

bgrant0607 commented Aug 10, 2015

dchen1107 commented Sep 23, 2015

shaylevi2 commented Nov 10, 2015

vishh commented Nov 11, 2015

lknite commented Apr 14, 2023 • edited

ddmunhoz commented Jun 1, 2023

evie404 commented Jun 7, 2023

thockin commented Jun 7, 2023

dims commented Jun 8, 2023

kfox1111 commented Oct 19, 2023

aojea commented Nov 16, 2023

PJGuedes commented Nov 24, 2023

schrej commented Nov 24, 2023

k8s-triage-robot commented Feb 22, 2024

smst329 commented Mar 6, 2024

smst329 commented Mar 6, 2024

palonsoro commented Mar 6, 2024

polarathene commented Mar 6, 2024

kfox1111 commented Mar 6, 2024

polarathene commented Mar 7, 2024

kfox1111 commented Mar 7, 2024

oboote commented Mar 8, 2024 • edited

mattbrandman commented Apr 3, 2024

mateusz-kolecki commented Apr 10, 2024

lknite commented Apr 14, 2023 •

edited

oboote commented Mar 8, 2024 •

edited