Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally release GIL when calling run on simulations #1463

Open
ianrgraham opened this issue Jan 13, 2023 · 26 comments
Open

Optionally release GIL when calling run on simulations #1463

ianrgraham opened this issue Jan 13, 2023 · 26 comments
Labels
enhancement New feature or request

Comments

@ianrgraham
Copy link
Contributor

Description

There are a number of research applications where it would be great to be able to run multiple simulation objects concurrently and safely share data among them. A few that I can think of: Gibbs Ensemble method, nudged elastic band minimization, and parallel tempering.

One obstacle in efficiently performing these methods is the GIL. Though fortunately, the vast majority of code in HOOMD does not directly talk to the Python C API, and thus in theory the GIL can be released.

Though, this certainly opens up the chance of abuse and bugs-galore when C++ objects are mistakenly shared across threads. I ran into this in my personal testing by creating a single device = hoomd.device.GPU(), instantiating simulations with it, and then sending those simulations to a bunch of threads. This lead to random segfaults in the internal CachedAllocator when trying to allocate a temporary buffer. This was easily fixed by attaching a distinct device object to each simulation, but this example highlights the possibility of misuse that this feature would allow.

There would certainly be some thorny bits in allowing the GIL to be released, but I think the freedom it gives to plugin authors greatly outweighs the chances of misuse. For example, I was able to quickly implement an efficient nudged elastic band plugin by sharing the particle data arrays (on host and device) across threads. Each simulation object was able to work independently until the HalfStepHook was reached, wait until all threads reach an atomic barrier, share positional information (without copying!) to modify the net force, wait at an atomic barrier once more, and then continue independently.

Proposed solution

The implementation of the release is simple. Though knowing if there are any parts of the C++ code that require the GIL (like writing to stdout without introducing formatting errors) is much harder and I'm not 100% familiar with the entire code base.

I think the unsafety that this feature introduces can be mitigated by keeping it experimental, and/or warning developers that undefined behavior can easily be triggered if not careful.

Some undefined behavior can be mitigated by (easiest) ensuring in Python that objects are attached to at most one simulation, or (hardest) rewriting these implementations to be thread-safe. For example, the current implementation of hoomd.Device is able to be shared by multiple simulations, but it is not internally thread-safe. Internal C++ classes like the CachedAllocator were written to be called by one thread at a time.

One other detail is that for useful information to be shared across simulation threads (like position, velocities, etc.), the access logic for GlobalArray should be updated to (at a minimum) allow multiple threads to borrow immutably if not already borrowed mutably by the owning thread. This access could also be made atomic, but that's not totally necessary. Other concurrency mechanism (like barriers) can be used by plugin authors to ensure safety is maintained.

Additional context

No response

@ianrgraham ianrgraham added the enhancement New feature or request label Jan 13, 2023
@joaander
Copy link
Member

Yes, supporting multiple simultaneous simulations within a single HOOMD-blue script is something I would like to support.

In addition to the excellent points you bring up, one also needs to reacquire the GIL every time HOOMD may use the Python interpreter. Some of these events include:

  • Messenger output when openPython has been called.
  • When calling custom triggers and actions.
  • When the log writer is invoked.
  • When accessing snapshot or other per-particle quantities that return numpy arrays.

The pybind11 docs state:

pybind11 will ensure that the GIL is held when it knows that it is calling Python code.

(https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil), but I don't know how much we can rely on that.

The alternate solution is to selectively release the GIL only in the known C++ only expensive kernel loops. This solution modifies more code, but the failure mode is a lack of performance scaling instead of undefined behavior / segmentation faults.

Some undefined behavior can be mitigated by (easiest) ensuring in Python that objects are attached to at most one simulation

I'm certain we already do this. It is an error to add an Operation instance to more than one Simulation.

@joaander
Copy link
Member

We additionally need to consider GPU synchronization more carefully to ensure that user's threads are accessing the data they intend and not data from previous steps while asynchronous kernels are still in flight. Also, each independent simulation should be run in a separate CUDA stream so that there is the possibility for kernels to overlap. If this worked, it could could enable better throughput batching many small simulations on one GPU.

@ianrgraham
Copy link
Contributor Author

(https://pybind11.readthedocs.io/en/stable/advanced/misc.html#global-interpreter-lock-gil), but I don't know how much we can rely on that.

I believe that it can be relied on. The custom triggers uses PYBIND11_OVERLOAD_PURE, so that should be safe. The gotchas that they mention appear to be a good guide.

The alternate solution is to selectively release the GIL only in the known C++ only expensive kernel loops. This solution modifies more code, but the failure mode is a lack of performance scaling instead of undefined behavior / segmentation faults.

I agree, but I hope that wouldn't need to be the case. I think the API surface that goes from C++->Python is small enough that it can be understood encapsulated. I think code with numpy arrays, like you mention, will need attention (various properties and accessors), and classes that hold onto pybind11::objects as well (PythonAnalyzer, PythonTuner, PythonUpdater).

Messenger.openPython appears to already acquire the GIL, at least it is mentioned in a doc comment.

I'm certain we already do this. It is an error to add an Operation instance to more than one Simulation.

I think Device is the only object I've noticed that can exist across simulations. It would be a breaking change to disallow that. Maybe doing a copy.deepcopy on it when it's passed to a Simulation would be a fix? Or make it thread-safe?

We additionally need to consider GPU synchronization more carefully to ensure that user's threads are accessing the data they intend and not data from previous steps while asynchronous kernels are still in flight. Also, each independent simulation should be run in a separate CUDA stream so that there is the possibility for kernels to overlap. If this worked, it could could enable better throughput batching many small simulations on one GPU.

Yea I was gonna add that, but I wanted to get y'all onboard first haha. That would require us to add a stream parameter to ever gpu driver, a bit of work, but we might be able to regex-match and replace for the most part. I'd assume we would add an additional parameter to device initialization like device = hoomd.device.GPU(stream=1). You could also have this happen automatically, but that would require some global state in the hoomd module, yuck.

I'm happy that there is interest here! I'll get started on a draft PR.

@ianrgraham
Copy link
Contributor Author

ianrgraham commented Jan 13, 2023

We additionally need to consider GPU synchronization more carefully to ensure that user's threads are accessing the data they intend and not data from previous steps while asynchronous kernels are still in flight.

I've found that a hipDeviceSynchronize() followed by a std::barrier.arrive_and_wait() (C++20) is enough to synchronize threads. None of the current hoomd code requires this, but anything new that crosses a thread boundary needs one before and one after.

@joaander
Copy link
Member

I would prefer to make Device thread safe. While we can break APIs in major releases (I'm planning a v4 release soon), I would prefer to allow sharing devices. In practice I would personally use separate devices for parallel simulations. Each Device would have a different msg_file to separate the output streams. But I don't think we necessarily need to force this on all users.

As you say, this complicates how parallel streams would be assigned - but lets not get to far ahead. I'm not 100% certain that this is feasible. It is something I have had in mind since day 1 with HOOMD - to enable better use of the parallel resources on the GPU for small simulations. However, the early generations of GPUs where not capable of concurrent kernels or had severe limitations on what could run concurrently. I have not recently reviewed the limitations on the latest GPUs.

@ianrgraham
Copy link
Contributor Author

Understood, though you would still image at most one stream per simulation, right? If that's all, I don't think the streams addition is that complicated code-wise. Just that if you want stream partitioning to happen automatically, the hoomd module would need to keep a count of in-use streams, and a list of any released-and-open streams.

I don't think the streams refactor is essential to the GIL release being useful, so that can be discussed more, and maybe turn into a separate PR. Like you said, biggest benefit would come from many small sims run in parallel, masking the overhead of kernel launches.

@joaander
Copy link
Member

Understood, though you would still image at most one stream per simulation, right? If that's all, I don't think the streams addition is that complicated code-wise. Just that if you want stream partitioning to happen automatically, the hoomd module would need to keep a count of in-use streams, and a list of any released-and-open streams.

Yes exactly. Conceptually, the stream should be owned by the simulation, not the device, so that all kernels for that simulation are processed in order. However, the Simulation class doesn't exist at the C++ level. The one simulation specific class that all others have access to is ParticleData. So, at the C++ level, we could:

  • Create (hipStreamCreate) the primary stream in the ParticleData class to allow Device sharing using m_sysdef->getStream()
  • Or, we could create the stream in Device and require that users instantiate multiple devices to get parallel execution on the GPU.

The CUDA/HIP runtime API handles creating a new stream each time streamCreate is called, so we don't need to manually handle global state.

I don't think the streams refactor is essential to the GIL release being useful, so that can be discussed more, and maybe turn into a separate PR. Like you said, biggest benefit would come from many small sims run in parallel, masking the overhead of kernel launches.

Yes, of course. The PRs for GIL and streams should be separate. I am planning on replacing all the HIP calls with another interface layer in #1063. Since I will be touching all the kernel calls, I could add streams at that time.

@ianrgraham
Copy link
Contributor Author

ianrgraham commented Jan 13, 2023

Aaahh, I hadn't noticed that looking at the stream API, super simple! I think either model is fine, though naturally I would have thought the stream belongs with the Device (and really as a member of ExecutionConfiguration since other GPU details are stored there), but the SystemDefinition or ParticleData classes work as well.

I don't know if there are any cases where you would want everything to be forced into the same stream, but if so that is a reason to opt for the keeping those details with Device.

@joaander
Copy link
Member

I don't know if there are any cases where you would want everything to be forced into the same stream, but if so that is a reason to opt for the keeping those details with Device.

Yes, at the C++ level it is more natural to put the stream in ExecutionConfiguration (Device in Python).

Regarding the GIL, the pybind11 docs suggest testing in Debug mode. I do have one job in the CI matrix doing just this: https://github.com/glotzerlab/hoomd-blue/actions/runs/3904935986/jobs/6672203929 . That should help catch some issues with the GIL.

@ianrgraham
Copy link
Contributor Author

Okay cool! I'll get separate PRs going for these!

@b-butler
Copy link
Member

b-butler commented Jan 16, 2023

@ianrgraham it may be worth waiting on doing this. Python 3.12 will have a compile time option to remove the GIL entirely. In even further releases it may be removed by default. I think that this is the easiest and most promising approach as it is done by the CPython developers directly.

edit: linking PEP https://peps.python.org/pep-0703/

@ianrgraham
Copy link
Contributor Author

Ahh, well that's cool to hear! But that's still quite a ways away, probably an October 2023 release? I don't mind spending some time with this since I need the behavior for some of my plugins to run efficiently.

I can keep the no-GIL changes in my personal fork, but would you all mind if I intend to merge them? I don't intend to expose the "no-GIL" path as the default, just as an option that should be really hard for an novice users to notice and use. Possibly as a flag on the Simulation object (Simulation.__gil = False), or a function on the _cpp_obj that isn't neatly wrapped (Simulation._cpp_obj._run_no_gil(steps)).

@ianrgraham
Copy link
Contributor Author

Separately, I've made quite a bit of progress with refactoring a unique stream that is attached to the ExecutionConfiguration. One thing I'm intending to ignore here though is that there are a couple of kernels that already have managed streams that appear to extend multiple GPUs, like the depletant and Tree NList codes. I'm planning to just leave these cases alone, but basically any kernel that is currently submitting to the default stream will be updated to submit to the ExecutionConfiguration->getStream()

@joaander
Copy link
Member

Separately, I've made quite a bit of progress with refactoring a unique stream that is attached to the ExecutionConfiguration. One thing I'm intending to ignore here though is that there are a couple of kernels that already have managed streams that appear to extend multiple GPUs, like the depletant and Tree NList codes. I'm planning to just leave these cases alone, but basically any kernel that is currently submitting to the default stream will be updated to submit to the ExecutionConfiguration->getStream()

Have you profiled with Nsight systems and verified that you get parallel execution on the separate streams?

@ianrgraham
Copy link
Contributor Author

Have you profiled with Nsight systems and verified that you get parallel execution on the separate streams?

No not yet, but I certainly will. A couple more regex-replaces before I'm there.

@joaander
Copy link
Member

I can keep the no-GIL changes in my personal fork, but would you all mind if I intend to merge them? I don't intend to expose the "no-GIL" path as the default, just as an option that should be really hard for an novice users to notice and use. Possibly as a flag on the Simulation object (Simulation.__gil = False), or a function on the _cpp_obj that isn't neatly wrapped (Simulation._cpp_obj._run_no_gil(steps)).

I strongly prefer not to merge experimental code, even when activated as an opt-in. Doing so introduces maintenance challenges for many years to come. Most Python packages (e.g. numpy) release the GIL without any user intervention, HOOMD-blue should do so as well if possible.

I'd be happy to merge your changes that release the GIL if it does so by default and you have tested and are reasonably sure that the GIL is reacquired when needed. The release can come with a warning that multithreaded simulations are experimental - but just releasing the GIL can enable workflows that don't involve that (e.g. analysis in parallel with simulation).

@ianrgraham
Copy link
Contributor Author

ianrgraham commented Jan 17, 2023

I'd be happy to merge your changes that release the GIL if it does so by default and you have tested and are reasonably sure that the GIL is reacquired when needed.

Of course, I'll do my best to ensure everything is airtight.

I had one other question for the both of you, regarding existing usage of streams in HOOMD. Would you prefer that I not refactor these existing cases that use a GPUPartition and streams, like these kernel drivers?

void generate_num_depletants(const uint16_t seed,
const uint64_t timestep,
const unsigned int select,
const unsigned int rank,
const unsigned int depletant_type_a,
const Scalar* d_lambda,
const Scalar4* d_postype,
unsigned int* d_n_depletants,
const unsigned int block_size,
const hipStream_t* streams,
const GPUPartition& gpu_partition,
const unsigned int ntypes);
void get_max_num_depletants(unsigned int* d_n_depletants,
unsigned int* max_n_depletants,
const hipStream_t* streams,
const GPUPartition& gpu_partition,
CachedAllocator& alloc);

Is there an intention down the road to phase out that code? Since plain old MPI is good enough to scale HOOMD to multi-GPU systems?

At the moment I have left these kernels alone, and there doesn't seem to be any harm in them running on a different stream, since that was already happening in the code before.

Another case that I've noticed the usage of streams is in NListTreeGPU, but that's because you are leveraging extern code for the BVH.

@ianrgraham
Copy link
Contributor Author

I've also found some places where kernel drivers are not found in a kernel namespace, am I free to fix that?

@joaander
Copy link
Member

Both the depletants code and NListTreeGPU are running streams per type so that kernels issued once per particle type can run in parallel. They should continue to use the streams they create after your changes.

  • NlistTreeGPU already synchronizes with the default stream via hipDeviceSynchronize.
  • In the depletants code, you will need to add a hipStreamSyhchronize(m_exec_conf->getStream()) before the asynchronous operations are issued using the the m_depletant_streams* streams.

Feel free to fix any kernel namespace issues you find.

Sorry I didn't see this earlier, but take a look at: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#default-stream
Maybe setting CUDA_API_PER_THREAD_DEFAULT_STREAM=1 is an alternative to the changes you are making?

@ianrgraham
Copy link
Contributor Author

Got it.

Oh and it's no problem. That may be a better solution, since it might also apply to thrust operations. That was an issue I was anticipating having to deal with, since they are submitting to the default stream. I'll test out that flag and what I currently have.

@vyasr
Copy link
Contributor

vyasr commented Jan 21, 2023

@ianrgraham it may be worth waiting on doing this. Python 3.12 will have a compile time option to remove the GIL entirely. In even further releases it may be removed by default. I think that this is the easiest and most promising approach as it is done by the CPython developers directly.

edit: linking PEP https://peps.python.org/pep-0703/

Would note on the GIL front that while that work is very exciting there are still a lot of concerns to be hammered out. Based on the ongoing discussions, I would not be at all surprised by any of the following happening (or not happening):

  1. nogil slipping past Python 3.12
  2. Both gil and nogil Python builds persisting in parallel for multiple Python release
  3. Have various other Python packages containing extension modules not support running in nogil mode for an extended period of time (the author of PEP 703 has worked on porting and has generally found it painless, but I can't imagine that there won't be exceptions)
  4. nogil requiring a Python major version bump (i.e. Python 4.0)

Any of the above would delay its usability for HOOMD, potentially significantly. Moreover, removing the GIL isn't a magic button that will suddenly make the code thread-safe, so that work needs to be done regardless. Given how tight HOOMD 3.0's Python-C++ integration is, there's probably quite a few places to think about how HOOMD uses Python (pybind) objects in C++ to ensure that they're all thread-safe. Once the core HOOMD code is thread-safe, actually releasing the GIL is pretty straightforward with pybind. As such, if getting multithreaded HOOMD support is of sufficient interest I wouldn't advise waiting on GIL removal for that.

@ianrgraham
Copy link
Contributor Author

ianrgraham commented Jan 22, 2023

My intention here is not to release the GIL to make a single simulation multi-threaded, but rather to be able to run many simulations concurrently in one python script and use my own locks to orchestrate reads and writes across systems. So users can't create undefined behavior unless they were to write some C++ plugin code that purposely shared data across threads and didn't orchestrate it properly.

Trying to make all of hoomd thread-safe at a fine-grained level would be a heck of a lot of work. Though honestly, if one wants simply any thread-safety, a not-so-terrible solution might be to just add a std::shared_mutex m_lock member to SystemDefinition. If you want to get some information from another simulation thread, or modify it's data, you have to acquire the lock first as a shared_lock or unique_lock respectively. It's not totally airtight, this is C++ after all, and it's overly restrictive in cases where a simulation may want to share some data as mutable and another bit of data as immutable, but it's not too bad.

And it actually wasn't too bad to patch all the holes that are made by releasing the GIL during calls to run. The fact that pybind11 inserts GIL assertions in Debug mode made this really easy. While hoomd has many Python->C++ calls, there are only a few C++->Python ones; just Messenger, CustomForceCompute, ParticleFilterCustom, PythonShapeMove, GSDDumpWriter, PythonUpdater, PythonTuner, and PythonAnalyzer have a method or two each that needed wrapping. Additionally there is TriggerPy, VariantPy, and HalfStepHookPy, but these are already safe because they all implicitly acquire the GIL since their calls are wrapped with PYBIND11_OVERLOAD_*.

@joaander
Copy link
Member

Indeed, I would have no intention to provide any type of thread safety within a single Simulation object.
To cover the analysis thread use-case, users should use global snapshots, or copies of local snapshots. Users should only construct, set parameters, access, and run simulations from one thread. We can document this in new threading model documentation. We would need a large amount of work to enable threaded access to a single simulation and I see no important use-cases that it would enable.

Advanced sampling techniques do not require threading. You can use MPI partitions and mpi4py to orchestrate independent runs. The only new use-case I see for threading multiple simulations is the possibility of running multiple smaller simulations on a single GPU. This has never been efficient with separate processes sharing a GPU, and it remains to be seen if it is efficient with threads.

@vyasr
Copy link
Contributor

vyasr commented Jan 28, 2023

That makes sense, I'm fine with that limited scope. My only point was to say that even if Python removes the GIL (or makes it possible to turn off the GIL) it won't actually make the code thread-safe, and there's no indication that a nogil Python will be the default any time soon, so I would proceed with making any necessary changes to HOOMD now. In a future with a nogil Python the only change would then be to remove the pybind code for releasing the GIL since that would become redundant.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale There has been no activity on this for some time. label Oct 15, 2023
@ianrgraham
Copy link
Contributor Author

I'll put together the PR for releasing the GIL soon. That part was completed a while ago. The overhaul of the kernel launches to independent cuda streams I'm less confident about, and so I'd omit that for the sake of just getting the primary feature in.

@github-actions github-actions bot removed the stale There has been no activity on this for some time. label Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants