New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rlimit support #3595
Comments
We can set a sane default for now. Do we want this to be exposed as a knob On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant notifications@github.com
|
Are there any downsides to setting high limits by default these days? I On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal notifications@github.com
|
+1 to toggle, putting it in the spec is overkill IMO. |
If there is a toggle for "few" vs "many" everyone will choose "many". We On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol notifications@github.com
|
Kernel memory accounting seems to be disabled in our container vm image. On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin notifications@github.com
|
One way to restrict "many" would be to take into account the global machine On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan notifications@github.com
|
For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful) |
+1 to toggle, but what exactly does few and many mean? |
I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets. |
I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model. For large and few values, we can start with typical linux max for the resource as large, and typical default as few. @bgrant0607 what kind of model did you have in mind for representing these as resources? |
I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource. I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance. What's the downside of just exposing numerical parameters? I agree we should choose reasonable, modest defaults. |
small and large feels clunky, but I like the ability for site-admins to On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant notifications@github.com
|
Re. admin-defined policies, see moby/moby#11187 |
@bgrant0607: Is this something that we can consider for v1.1? |
We can, but I'll have 0 bandwidth to think about it for the next month, probably. cc @erictune |
Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means
Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage. |
Where does this stand? is it available through any config? |
@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need. |
Deploying ha-argocd helm chart uses a subchart -> ha-redis helm chart which uses -> haproxy, on a kubernetes cluster running redhat 9 experiences max memory and cpu in its haproxy pods which nearly instantly crash OOMKilled. Am working to set at the OS level, but would have expected to be able to set a default within kubernetes. For kubernetes to have already set a reasonable default, as well as the ability to set at the namespace level via resourcequota and maybe at the pod level with resources.ulimit or something similar. Since I'm using a helm chart which includes a helm chart passing parameters to the end application might require helm charts to be updated, seems reasonable to set at a higher level where I have more control. Looking for the "right way" to set ulimit w/ kubernetes in my cluster running via redhat 9. If its something to set at the OS level I would have expected the setting to be in the kubernetes installation document. Based on comment above was able to fix with:
|
c'mon guys, this a a basic Linux feature. Can we get around moving this one ? For sure I could go a number of ways to fix it with an init container that sets it, but having it officially supported is the way to go. |
We recently had an incident related to process limits on our clusters and remembered this opened issue. I might be interested in pushing this forward assuming I can get time allocated from work. I previously implemented API and kubelet change for a Pod feature (#44641), so I assume the moving pieces are similar. Would the next step be submitting a KEP? and if so, would this be more for SIG Scheduling or SIG Node or another SIG? There's also a new complication since the issue was first posted. Docker may support rlimits but we'd have to make sure it's also supposed for other container runtimes. |
SIG Node for sure. Before writing a full KEP, I always recommend that people write a simple Google Doc, which lays out the key parts of the KEP (problem statement, major design points, tradeoffs, alternative options). It's easier to quick-iterate thst way.
welcome back :) |
Yes please @evie404 thanks for stepping up again! |
We just hit this issue too. |
/cc |
\cc |
There is a new "Subscribe" button in the "notifications" section of the sidebar to the right of the issue description that can be used to get updates about an issue without writing a comment. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Still needed and k8s default limits are far far below what a modern linux host can handle. Should either be controllable or would be nice if the default was larger. |
@smst329 if you want to change the default ulimits, some container engines allow this at the container engine level. For example, cri-o allows setting this via However, while this would save the use case for default larger rlimits, it would still be good to be able to properly set this per-container, so we don't have to just grant high limits for everybody only so that a small subset of applications that require it can work. |
For most software I think it should be ok to default to a soft limit of 1024 with a much higher hard limit. Usually the hard limit is This won't help software like Envoy however as they don't properly implement support to raise their soft limit internally, which they should since they advise setting it to the max. Other software however can regress when the inherited soft limit is raised higher than it should be. It would probably be good for those that need the higher limits to mention the software they need it for and how high they need that to be.
@thockin There are downsides for setting high soft limits which can introduce regressions. Thus changing that would just break software for a different group of users. It needs to be configurable, although you may be able to strike a compromise (eg for AFAIK memcg accounting concerns were addressed some time ago, so that shouldn't have any problems. |
Yeah, we hit a problem with limits being raised too high by default in the runtime and workloads started oom'ing after k8s upgrade as they had memory limits set and needed signifcantly more memory allocated to deal with all the extra fd's it decided to freak out about. A smilar issue is hit by processes that start and then try to close all fd's. They can take significantly longer to startup if granted too many. But some software really does need lots of fd's. So it does need to be configurable for certain workloads and set low for the rest. |
That is one of the known regressions that can happen with some software like with Java IIRC, where the range of FDs is pre-allocated in an array during init, which when excessive will use many GB of memory. MySQL was also known for this but I have heard since fixed. It should only occur if the soft limit is set too high, some software (like anything with Go 1.20 IIRC) will implicitly raise the soft limit to the hard limit if it deems it safe to do so, originally it was problematic since it didn't restore the soft limit for child processes which did not have the same safety, but that's since been fixed.
That one is also due to excessive soft limit. The software is usually a daemon service that is doing a best practice to close the FDs during init and on non-container environments you wouldn't have a soft limit mistakenly set to excessive levels ( Some have adapted with improved approaches to this step that aren't affected, but those are more sensitive to the environment as they either have a more specific expectation of the platform (iterating
Yes, wastes CPU and I've seen it slow down processes by 10 minutes or so. Worse is when it affects other software operations like package managers,
They need to document that (some don't like Envoy), and ideally they handle that internally to raise the soft limit for the process(es) they need to, which can either be to the hard limit or configurable by that software (as nginx and others support when the raised soft limit should be restricted in scope to avoid issues). |
Being able to set a low soft and hard limit can limit its abuse for those pods that don't have any need for it (most pods), so configurable per pod would still be good. |
Also running into this issue of really high ulimit on the new AWS EKS Linux 2023 AMI's. Been tearing my hair out all day trying to set limits for rabbitmq and mysql pods. mysql won't start and rabbitmq is consuming 15x the usual amount of memory. I've tried using:
I've even set limitNOFILE=65565 on containerd conf with daemon-reload and restart containerd in user-data to apply it globally on the worker nodes... Yet if I exec into pods on those nodes - I still get a 2^30 ulimit... I'm completely at a loss - going to revert back to AL2 AMI's until I can think of something else... |
@oboote had the exact same AL2023 issue and have no idea what to do. I had the issue with redis HAProxy chart |
Have the same issue here with Varnish 6.1 on Debian 11 images. Since |
moby/moby#4717 (comment)
Now that this is in, we should define how we want to use it.
The text was updated successfully, but these errors were encountered: