Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi: multiple brokers per node with multiple MPI tasks per node hangs with mvapich2-2.3.7-intel #5912

Open
grondo opened this issue Apr 25, 2024 · 2 comments

Comments

@grondo
Copy link
Contributor

grondo commented Apr 25, 2024

While trying to run some of the MPI tests under a test instance, I noticed that MPI bootstrap with mvapich2-2.3.7-intel hangs when there are multiple tasks per node:

$ flux run -N4 -n4 t/mpi/hello
f5Cm644CF: completed MPI_Init in 1.110s. There are 4 tasks
f5Cm644CF: completed first barrier in 0.028s
f5Cm644CF: completed MPI_Finalize in 0.041s
$ flux run -N4 -n8 t/mpi/hello
[hangs]

This is with v0.60.0. This reproduces on current master, but instead of a hang, a task get a Bus Error:

$ flux run -N4 -n8 t/mpi/hello
[corona212:mpi_rank_2][error_sighandler] Caught error: Bus error (signal 7)

The bus error occurs in MPIDI_CH3I_CM_SHMEM_Sync (). There are no details in the backtrace since debug symbols are not available.

@wihobbs
Copy link
Member

wihobbs commented Apr 25, 2024

Hmm, is it of concern this didn't get caught in our GitLab CI? Our logs from last night with intel-classic and the mvapich2 compiler show:

Running with intel-classic compiler and mvapich2 MPI
f28u15Vy
f28u15Vz
f28vV4nK
f28wy44f
f2bjg5Q3
f2bjg5Q4
f2bjg5Q5
f2bmA4gP
f2bjg5Q3: completed MPI_Init in 0.461s.  There are 4 tasks
f2bjg5Q3: completed first barrier in 0.000s
f2bjg5Q3: completed MPI_Finalize in 0.010s
Hello World from rank 1
Hello World from rank 0
Hello World from rank 3
Hello World from rank 2
MVAPICH2 Version      :	2.3.7
MVAPICH2 Release date :	Wed March 02 22:00:00 EST 2022
MVAPICH2 Device       :	ch3:mrail
MVAPICH2 configure    :	--prefix=/usr/tce/backend/installations/linux-rhel8-x86_64/intel-2021.6.0/mvapich2-2.3.7-2575ifqlr5fbj34wdlj2fo2tmqdrehia --enable-shared --enable-romio --disable-silent-rules --disable-new-dtags --enable-fortran=all --enable-threads=multiple --with-ch3-rank-bits=32 --enable-wrapper-rpath=yes --disable-alloca --enable-fast=all --disable-cuda --enable-registration-cache --with-pm=hydra --with-device=ch3:mrail --with-rdma=gen2 --disable-mcast --with-file-system=lustre+nfs+ufs --enable-llnl-site-specific-options --enable-debuginfo
MVAPICH2 CC           :	/usr/tce/spack/lib/spack/env/intel/icc    -DNDEBUG -DNVALGRIND -O2
MVAPICH2 CXX          :	/usr/tce/spack/lib/spack/env/intel/icpc   -DNDEBUG -DNVALGRIND -O2
MVAPICH2 F77          :	/usr/tce/spack/lib/spack/env/intel/ifort   -O2
MVAPICH2 FC           :	/usr/tce/spack/lib/spack/env/intel/ifort   -O2

That's off current master (or, rather, whatever master was at 3AM today.)

If so, I should open a second issue on flux-test-collective to see why this didn't get caught.

@grondo
Copy link
Contributor Author

grondo commented Apr 25, 2024

Does the gitlab CI run in a multiple brokers per node configuration? If not, I guess would could add that because having that working does aid in testing I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants