Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of workers to prevent system OOM #580

Open
lathiat opened this issue Jan 17, 2023 · 4 comments
Open

Limit number of workers to prevent system OOM #580

lathiat opened this issue Jan 17, 2023 · 4 comments
Labels
Feature New feature, not a bug

Comments

@lathiat
Copy link

lathiat commented Jan 17, 2023

It seems there is no limit to the number of concurrent workers for LXCFS requests.

In situations where lxcfs requests are going slowly for some reason (whether deadlocked or just going slow due to high load or some other cause) and many such requests are coming in lxcfs can consume 1000s of threads and 10s-100s of GB of memory and crash the entire system. As seen while working #471 and #579.

I suggest that we need a limit, even if a fairly high one, to prevent this from happening. This should include non-debug level logging of when the limit is hit.

@nkshirsagar
Copy link

Perhaps there should also be a timeout for worker thread in lxcfs after which it should return EIO to the application making the fuse call. That will prevent the libfuse+kernel deadlock even if we do end up with lots of stuck lxcfs threads?

@lathiat
Copy link
Author

lathiat commented Jan 17, 2023

It seems I was mistaken and the OOM was primarily due to consuming applications behaving badly when their reads were stuck. So while this probably would still be ideal I mistakenly thought more ram was consumed by lxcfs.

A timeout may be sensible however depending on where the deadlock exists it may not be possible to action it.

@nkshirsagar
Copy link

It seems I was mistaken and the OOM was primarily due to consuming applications behaving badly when their reads were stuck. So while this probably would still be ideal I mistakenly thought more ram was consumed by lxcfs.

A timeout may be sensible however depending on where the deadlock exists it may not be possible to action it.

@mihalicyn can lxcfs timeout if the worker thread does not return in a specified time and return EIO or similar to the caller?

@mihalicyn
Copy link
Member

mihalicyn commented Jan 17, 2023

@lathiat @nkshirsagar yep, that's a good idea. I'll think about that, of course.

Upd: libfuse versions >= 3.12.0 has max_threads parameter libfuse/libfuse@af5710e

In snap environment Ubuntu Focal is used, so, we have libfuse 3.9.0

@stgraber stgraber added the Feature New feature, not a bug label Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature, not a bug
Development

No branches or pull requests

4 participants