Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

Open
zippylab opened this issue Apr 3, 2024 · 3 comments
Open

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

zippylab opened this issue Apr 3, 2024 · 3 comments

Comments

@zippylab
Copy link

zippylab commented Apr 3, 2024

Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).

Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.

This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:

Step 1:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 47.39/47.39/47.39GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 11.94/11.94/11.94GB (64.00GB total available), min=0, max=1
…
Step 5:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.27/56.27/56.27GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 35.79/35.79/35.79GB (64.00GB total available), min=0, max=1
…
Step 10:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.24/56.24/56.24GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 60.71/60.71/60.71GB (64.00GB total available), min=0, max=1
…
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com 1: terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos failed to allocate memory for label "sendbuf".  Allocation using MemorySpace named "SYCLDeviceUSM" failed with the following error:  Allocation of size 2.067 G failed because of an unknown error.  (The allocation mechanism was sycl::malloc_device().)
@abrooks98
Copy link
Contributor

@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able to validate the solution

@zippylab
Copy link
Author

zippylab commented Apr 4, 2024

@abrooks98 I don't have a small reproducer yet. Where we observed it is pretty deep down in XGC functionality, and involves a number of template instances as well as Kokkos views of more than one variety including unmanaged views. Constructing something simple to demonstrate it may take quite a bit of trial-and-error. I'll start working on it, but meanwhile it may be that @zhenggb72, one of the Intel people I've been working on this with, could help with validating the solution using XGC.

@zhenggb72
Copy link
Contributor

@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants