Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set number of threads for arrow #134

Open
ivirshup opened this issue Mar 23, 2023 · 11 comments
Open

Set number of threads for arrow #134

ivirshup opened this issue Mar 23, 2023 · 11 comments

Comments

@ivirshup
Copy link

Would setting the number of threads used by arrow be in-scope for this library?

(main docs on arrow thread pools)

arrow uses environment variables to set the numbers of threads used at import time, but then allows dynamically changing the number of threads used via setter functions, like set_cpu_count. Notably, there are two separate thread pools used one for compute and one for IO.

Is this functionality in scope for this library? If so, it would be great to see this feature.

@jeremiedbb
Copy link
Collaborator

Hi @ivirshup, I think that arrow doesn't implement its own threadpool but instead relies on OpenMP for that. So I think controlling the number of OpenMP threads should work:

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api='openmp'):
    ...

@ivirshup
Copy link
Author

Thanks for the response. I'm not sure what the specific implementation is, but that example doesn't seem to set the number of threads pyarrow sees. I'll demonstrate:

Using threadpoolctl after pyarrow import

import pyarrow as pa
print(pa.cpu_count())

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api="openmp"):
    print(pa.cpu_count())
16
16

Using threadpoolctl during pyarrow import

from threadpoolctl import ThreadpoolController
controller = ThreadpoolController()

with controller.limit(limits=1, user_api="openmp"):
    import pyarrow as pa
    print(pa.cpu_count())
16

Setting OMP_NUM_THREADS

import os
os.environ["OMP_NUM_THREADS"] = "1"

import pyarrow as pa
print(pa.cpu_count())
1

@jeremiedbb
Copy link
Collaborator

Right, I misinterpreted their doc. I looked into their source code and it appears that they implement their own threadpool, which can be configured by the OMP_NUM_THREADS env var even though it's usually used to control OpenMP threadpool.

I'm not sure yet if we want to explicitly support arrow. An alternative would be to allow custom controllers as requested here #137.

@ivirshup
Copy link
Author

An alternative would be to allow custom controllers as requested here #137.

I believe I prompted that 😆

@ogrisel
Copy link
Contributor

ogrisel commented Jul 11, 2023

@ivirshup #138 was merged in the master branch. Feel free to give it a shot to see it's enough for arrow.

If filename-based dynlib matching we could extend it to complement the filename match with a symbol name match as discussed in #138 (comment) but this is not yet implemented.

@ivirshup
Copy link
Author

ivirshup commented Jul 12, 2023

Great! Thanks @ogrisel and @jeremiedbb!

I'm a little unfamiliar with linking, as I've avoided learning much C++, but have given this a shot. It seems to work, but there's something a little strange going on. Here's what I've written:

import threadpoolctl, pyarrow as pa

class ArrowThreadPoolCtlController(threadpoolctl.LibController):
    user_api = "arrow"
    internal_api = "arrow"

    filename_prefixes = ("libarrow",)

    def get_num_threads(self):
        print(f"got {pa.cpu_count()} threads")
        return pa.cpu_count()

    def set_num_threads(self, num_threads):
        print(f"set to {num_threads} threads")
        pa.set_cpu_count(num_threads)

    def get_version(self):
        print("get_version called")
        return pa.__version__

    def set_additional_attributes(self):
        pass

threadpoolctl.register(ArrowThreadPoolCtlController)

with threadpoolctl.threadpool_limits(1):
    print(pa.cpu_count())

Here's the output:

get_version called
get_version called
get_version called
get_version called
got 16 threads
got 16 threads
got 16 threads
got 16 threads
set to 1 threads
set to 1 threads
set to 1 threads
set to 1 threads
1
set to 16 threads
set to 16 threads
set to 16 threads
set to 16 threads

This is from running it just once. This increases each time I register the class, so it could be nice if there was some level of uniqueness for controllers.

Maybe this has to do with the number of dynlibs that start with the prefix? This was run in a conda environment which has these dylibs:

./lib/python3.10/site-packages/pyarrow/libarrow_acero.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_dataset.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python_flight.dylib
./lib/python3.10/site-packages/pyarrow/libparquet.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_python.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_substrait.1200.dylib
./lib/python3.10/site-packages/pyarrow/libarrow_flight.1200.dylib

@ivirshup
Copy link
Author

Ah, I think I'm starting to see. I think I'm getting a dynlib for all matching files as the expectation is that I am setting the threads directly using the dynlib CDLL object.

I'm not sure I'm going to figure out how to do that. Maybe it could be done by calling the C++ methods for setting threads. I think just using "libarrow." as a signal where I then use the python interface and hope they are referring to the same dynlibs should work for my cases.

@jeremiedbb
Copy link
Collaborator

The purpose of threadpoolctl is to make it easy to control the threadpools of native libraries that don't usually have python bindings. When python bindings for the library exist, I'd advise to use them directly. For your use case I'd simply do:

from contextlib import contextmanager

@contextmanager
def limit_arrow(num_threads):
    old_num_threads = pa.cpu_count()
    try:
        pa.set_cpu_count(num_threads)
        yield
    finally:
        pa.set_cpu_count(old_num_threads)


with limit_arrow(1):
    ...

@jeremiedbb
Copy link
Collaborator

That being said, I think it would still be interesting to support arrow directly. For instance threadpoolctl provides a way to limit all supported libraries at once. Not having to write custom context managers for all libraries is nice.

I've tried to use the symbols from the shared object but there's a catch. arrow being a c++ library, symbol names are mangled :(

nm /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T _ZN5arrow24GetCpuThreadPoolCapacityEv

nm --demangle /home/jeremie/miniforge/envs/tmp2/lib/libarrow.so.1200.1.0 | grep "GetCpuThreadPoolCapacity"
00000000005ea990 T arrow::GetCpuThreadPoolCapacity()

There are ways to demangle the name but it's gonna require some work to implement it in a robust and cross-platform way.

@ivirshup
Copy link
Author

Not having to write custom context managers for all libraries is nice.

Yeah, this is really what I like about this library!

native libraries that don't usually have python bindings.

So my concern about where calling pyarrow wouldn't work is if I was calling some other program that calls out to pyarrow.compute. If I either don't have pyarrow in this environment, or this program is using a bundled/ separate version of arrow, the pyarrow approach doesn't work.

Maybe arrow devs would have interest in supporting this?

@jeremiedbb
Copy link
Collaborator

jeremiedbb commented Jul 13, 2023

ping @jorisvandenbossche, we'd like to have your opinion on that :)

We're interested in adding support for arrow in threapoolctl but I'm facing some issues. The way threadpoolctl works is by searching and loading the shared library and try to call the symbols responsible to control the number of threads. In arrow these symbols seems to be GetCpuThreadPoolCapacity and SetCpuThreadPoolCapacity.

The issue is that since arrow is a c++ library, the names of the symbols are mangled, see #134 (comment), making it hard to retrieve for threadpoolctl. I can see 3 alternatives:

  • arrow exports these symbols as C functions: extern "C" int arrow::GetCpuThreadPoolCapacity(), but arrow people might not want to do that 😄 and even though it does not completely guarantee that the name of the symbol won't change at all (it could acquire a leading underscore for instance).

  • threadpoolctl implements a mechanism to try to demangle the name by looking at the list of all symbols in the dso and try to match the mangled names with the one we're looking for. It will be very tricky to make it work consistently on all platforms.

  • the latest version of threadpoolctl allows third party developpers to implement and register a custom controller for their library. You can see an attempt at writing such a controller for arrow, through pyarrow, here Set number of threads for arrow #134 (comment). Here the controller does not rely on the c++ symbols but on their python bindings instead. We can't do that in threadpoolctl because we don't want to have a dependency on pyarrow.
    Do you think the devs of pyarrow would be willing to implement and register an official controller for arrow ? An issue with that, mentionned here Set number of threads for arrow #134 (comment), is that if another lib bundles arrow, it will need to register its own controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants