Skip to content

MPICH CH4:UCX with CUDA support (3.3.x)

Yanfei Guo edited this page May 13, 2022 · 2 revisions

Preliminary HOW-TO run CUDA applications with MPICH using CH4:UCX / HCOLL with CUDA support.

Note: In order to enable CUDA aware reduction collectives you need to build / run with HCOLL collectives library.

  • Required software

    • MPICH v3.3
    • UCX v1.3
    • HPCX v2.1 includes HCOLL 4.0
    • CUDA v8.0 or higher - refer to NVIDIA documents for CUDA Toolkit installation
    • Verify that the GPUDirect RDMA kernel module is properly loaded on each of the compute systems where you plan to run the job that requires the GPUDirect RDMA.
      To check whether the GPUDirect RDMA module is loaded, run: service nv_peer_mem {status/start/stop/restart}
      To run this verification on other Linux flavors: lsmod | grep nv_peer_mem
    • GDR COPY plugin module  GDR COPY is a fast copy library from NVIDIA, used to transfer between HOST and GPU. For information on how to install GDR COPY, refer to its GitHub webpage
    • To verify that the gdrcopy kernel module is properly loaded on each of the compute node run: /etc/init.d/gdrcopy {start|stop|restart}
    • To check whether the GDR COPY module is loaded, run: lsmod | grep gdrdr
  • Building UCX

git clone https://github.com/openucx/ucx.git
cd ucx
./contrib/configure-release --prefix=<PATH> --with-gdrcopy --with-cuda=/usr/local/cuda (or set to appropriate path)
make && make install
  • Building MPICH
git clone https://github.com/pmodels/mpich.git
./configure --prefix=<PATH> --with-device=ch4:ucx --with-ucx=<PATH to CUDA aware UCX>
# To build with Mellanox HCOLL collectives library: --with-hcoll=<PATH to CUDA aware HCOLL>
  • Building OSU-Benchmark with CUDA
./configure --prefix=<PATH> CC=<PATH to mpicc> CXX=<PATH to mpicxx> --enable-cuda=basic --with-cuda=/usr/local/cuda
  • Example run command
mpirun -np 2 -map-by node -host ibm-p9-012:1,ibm-p9-013:1 -env UCX_NET_DEVICES=mlx5_0:1 -env MPIR_CVAR_ENABLE_HCOLL=1 -env LD_LIBRARY_PATH=$LD_LIBRARY_PATH ./install-mpich-cuda/libexec/osu-micro-benchmarks/mpi/collective/osu_allreduce -m 4:2048