Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

per-caller-thread limits #76

Open
orenbenkiki opened this issue Aug 3, 2020 · 9 comments
Open

per-caller-thread limits #76

orenbenkiki opened this issue Aug 3, 2020 · 9 comments

Comments

@orenbenkiki
Copy link

orenbenkiki commented Aug 3, 2020

The threadpool_limits are global. This makes it difficult to avoid oversubscription when invoking parallel operations (e.g., Numpy functions) from within a parallel divide-and-conquer algorithm.

Ideally, parallel multi-threading frameworks would be fully multi-threaded-aware, that is, have a limit on the total number of threads used, regardless of how many threads are generating requests. This however seems too much to ask for :-(

A simpler modification would be to set per-caller-thread limits. This way, a divide-and-conquer algorithm could, at each step, subdivide the total budget of threads. As an secondary upside, a budget of odd number of (2n+1) threads could be split to (n) threads for one sub-task and (n+1) threads for another, fully utilizing all threads, rather than setting a global budget of (n) threads per each (missing out on one) or (n+1) for each (oversubscribing).

Is such finer-grained control over thread limits possible? If so, I'd love to see support for it in threadpoolctl.

@jeremiedbb
Copy link
Collaborator

As far as I know, it's not possible in a multi-threaded setting, because we can only set the maximum number of threads of the native C libraries (BLAS, openmp) globally. I think MKL has a way to somehow specify a thread local max threads but I'm not sure it solves this issue.

It might be different if your algorithm uses multi-processing. Then you should be able to set the number of threads in each subprocess. You'd still have to keep track on the budget at each step, the scheduling part is outside of the scope of threadpoolctl.

@ogrisel
Copy link
Contributor

ogrisel commented Oct 1, 2020

The standard OpenMP and BLAS APIs do not provide a generic way to do this as @jeremiedbb said above. It would be great to lobby BLAS implementation developers to provide a consistent API to set the parallelism budget on a per-BLAS-call thread-local basis. I think the BLIS developers intended to do so a while ago but I have not followed their development recently and as far as I know OpenBLAS does not provide anything like this.

@orenbenkiki
Copy link
Author

@jeremiedbb seems correct in that, if one uses multi-processing, it is possible to use threadpoolctl to set a different "global" threads limit in each sub-process - at least, this seems to be working for me. That is, I use multiprocessing.Pool.map and wrap each invocation with a function that checks to see if it the 1st one running in the sub-process, and if so, it first ask threadpoolctl for a reduced number of threads and only then does the actual work.

@ogrisel
Copy link
Contributor

ogrisel commented Jan 31, 2022

Based on the last reply, I have the feeling that we can close this issue.

@ogrisel ogrisel closed this as completed Jan 31, 2022
@orenbenkiki
Copy link
Author

Note that the workaround has significant disadvantages:

  • It only works using multiprocessing and not multithreading
  • It requires trickery to set the number of threads per process only once (in the 1st invocation of a task on the process)
  • It is inefficient since long-running task(s) on few sub-processes are limited to using few CPUs due to the static allocation of threads to sub-processes, even when most CPUs are idle (other sub-processes having completed execution)

@ogrisel
Copy link
Contributor

ogrisel commented Jan 31, 2022

What you describe is something far beyond the scope of threadpoolctl. I think what you want is close to what TBB offers with a full-fludged task scheduler. However that would require all the threaded tools of the ecosystem (BLAS, machine learning libraries, signal processing libraries...) to use TBB instead of OpenMP... and currently in the Python word for instance, Cython does not have syntactic support for interfacing with a TBB runtime (as far as I know).

Also note that TBB has its limitations w.r.t. over-subscription in practical deployment scenarii like docker containers, see: oneapi-src/oneTBB#190 . They might be fixable though.

@ogrisel
Copy link
Contributor

ogrisel commented Jan 31, 2022

I can reopen with the issue with a more descriptive title, however it's unlikely to ever be solved because major BLAS implementations (e.g. OpenBLAS) do not offer such control (maybe BLIS does?) and this is not part of the OpenMP standard either (as far as I know).

@ogrisel ogrisel reopened this Jan 31, 2022
@ogrisel ogrisel changed the title Parallel Divide and Conquer per-caller-thread limits Jan 31, 2022
@orenbenkiki
Copy link
Author

It is indeed a tough problem and might not be solvable in general, and OpenMP/BLAS etc. don't make it easy.

It is much easier is "everyone" agrees on a single scheduler technology (e.g. TBB). This is a place where Julia has an advantage being new, and having multi-threaded scheduler within the language from an early stage, most packages tend to just use it so there's automatic balancing across multi-threaded apps.

That said, if we don't have it as an open issue then things wouldn't ever get any better...

@ogrisel
Copy link
Contributor

ogrisel commented Jan 31, 2022

The thing is that this problem will not be solved in threadpoolctl itself. So better open such issues on the issue tracker of open source BLAS/LAPACK implementations (starting with OpenBLAS and BLIS) and maybe OpenMP runtime implementations, although I am not sure their maintainers will be interesting in maintaining a feature that is not part of the OpenMP specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants