Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

zippylab · 2024-04-03T18:05:14Z

Our application XGC has conditional coding for GPU-aware MPI, which has been working correctly on some systems such as Perlmutter with NVIDIA A100 GPUs (cray-mpich/8.1.28) and Frontier with AMD MI250X GPUs (cray-mpich).

Testing this on the Sunspot testbed at Argonne using Intel PVC GPUs (Aurora MPICH: mpich/icc-all-pmix-gpu/52.2)), I observe uncontrolled memory growth apparently stemming from an MPI_Alltoallv() with large message sizes (O(GB)). The Aurora MPICH developers at Intel asked me to create a ticket here and provide them the ticket number.

This output shows memory usage queries at various timesteps in the test run, eventually leading to running out of GPU memory:

Step 1:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 47.39/47.39/47.39GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 11.94/11.94/11.94GB (64.00GB total available), min=0, max=1
…
Step 5:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.27/56.27/56.27GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 35.79/35.79/35.79GB (64.00GB total available), min=0, max=1
…
Step 10:
CPU memory usage at the beginning of time step: Min/Avg/Max used = 56.24/56.24/56.24GB (1134.38GB total available), min=0, max=0
GPU memory usage at the beginning of time step: Min/Avg/Max used = 60.71/60.71/60.71GB (64.00GB total available), min=0, max=1
…
x1921c0s6b0n0.hostmgmt2000.cm.americas.sgi.com 1: terminate called after throwing an instance of 'std::runtime_error'
  what():  Kokkos failed to allocate memory for label "sendbuf".  Allocation using MemorySpace named "SYCLDeviceUSM" failed with the following error:  Allocation of size 2.067 G failed because of an unknown error.  (The allocation mechanism was sycl::malloc_device().)

The text was updated successfully, but these errors were encountered:

abrooks98 · 2024-04-04T20:32:48Z

@zippylab do you have a small reproducer you can share which can mimic the workload and the described issue? We believe to understand the cause, but need to be able to validate the solution

zippylab · 2024-04-04T22:47:38Z

@abrooks98 I don't have a small reproducer yet. Where we observed it is pretty deep down in XGC functionality, and involves a number of template instances as well as Kokkos views of more than one variety including unmanaged views. Constructing something simple to demonstrate it may take quite a bit of trial-and-error. I'll start working on it, but meanwhile it may be that @zhenggb72, one of the Intel people I've been working on this with, could help with validating the solution using XGC.

zhenggb72 · 2024-04-05T01:45:11Z

@zippylab Alex is working with me on this. We think we have a fix for this issue, and we would like some reproducer to test it out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

zippylab commented Apr 3, 2024

abrooks98 commented Apr 4, 2024

zippylab commented Apr 4, 2024

zhenggb72 commented Apr 5, 2024

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

Memory growth with GPU-aware MPICH on Intel PVC GPUs #6959

Comments

zippylab commented Apr 3, 2024

abrooks98 commented Apr 4, 2024

zippylab commented Apr 4, 2024

zhenggb72 commented Apr 5, 2024