Skip to content

Using MPICH on Aurora@ALCF

Ken Raffenetti edited this page Mar 5, 2024 · 10 revisions

This page describes how to build and use MPICH on the 'Aurora' machine at Argonne. Aurora is uses Intel CPUs and GPUs with the Cray Slingshot interconnect. Support for Slingshot is provided via the system libfabric.

Prerequisite

As of 03/01/2024, it is best to build the MPICH git main branch for use on Aurora.

Build MPICH

Build ZE-enabled MPICH for use with the Cray PALS launcher

./autogen.sh
./configure --prefix=/home/raffenet/proj/mpich/i --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx
make -j16 install

A correctly configured MPICH build should print the following message in configure output.

*****************************************************
***
*** device      : ch4:ofi
*** shm feature : auto
*** gpu support : ZE
***
*****************************************************

Running MPI Applications

On Aurora, MPICH is configured without a process manager (--with-pm=no). We instead can use the mpiexec that is available as part of Cray's PALs package. Setting PALS_PMI=pmix in the execution environment is required for processes to properly query the runtime for job information. If launching processes on a single host, it is required to set MPIR_CVAR_SINGLE_HOST_ENABLED=0 in the environment.

Debugging

  • gdb: but it's not called gdb. it's gdb-oneapi
  • gdb4hpc: pretty nice parallel wrapper around gdb, though it will probably timeout if you have more than a few dozen processes

Common Issues

  • It's possible (maybe only with large counts) to trigger an unaligned memory error in the ZE code:
.../modules/yaksa/src/backend/ze/hooks/yaksuri_zei_type_hooks.c:101:30: runtime error: load of misaligned address 0xfbc50007672a3559 for type 'struct _ze_module_handle_t *', which requires 8 byte alignment
0xfbc50007672a3559: note: pointer points here
<memory cannot be printed>
Segmentation fault

If you don't need GPU processing you can work around this by configuring with --without-ze