Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PMI error when running on SDSC Expanse #6924

Open
JiakunYan opened this issue Feb 26, 2024 · 4 comments
Open

PMI error when running on SDSC Expanse #6924

JiakunYan opened this issue Feb 26, 2024 · 4 comments

Comments

@JiakunYan
Copy link

I am getting the following error when trying to run MPICH on SDSC Expanse (Infiniband machine with slurm).

srun -n 2 hello_world
PMII_singinit: execv failed: No such file or directory
[unset]: This singleton init program attempted to access some feature
[unset]: for which process manager support was required, e.g. spawn or universe_size.
[unset]: But the necessary mpiexec is not in your path.
PMII_singinit: execv failed: No such file or directory
[unset]: This singleton init program attempted to access some feature
[unset]: for which process manager support was required, e.g. spawn or universe_size.
[unset]: But the necessary mpiexec is not in your path.
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943014_0 key=PMI_mpi_memory_alloc_kinds
:
system msg for write_line failure : Bad file descriptor
[unset]: write_line error; fd=-1 buf=:cmd=get kvsname=singinit_kvs_943015_0 key=PMI_mpi_memory_alloc_kinds
:
system msg for write_line failure : Bad file descriptor
exp-9-17: 0 / 1 OK
exp-9-17: 0 / 1 OK
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=28936391.1 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=28936391.1
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 28936391.1 ON exp-9-17 CANCELLED AT 2024-02-26T11:17:29 ***

mpichversion output

MPICH Version: 4.3.0a1
MPICH Release date: unreleased development copy
MPICH ABI: 0:0:0
MPICH Device: ch4:ucx
MPICH configure: --prefix=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o --disable-silent-rules --enable-shared --with-pm=no --enable-romio --without-ibverbs --enable-wrapper-rpath=yes --with-yaksa=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh --with-hwloc=/home/jackyan1/opt/hwloc/2.9.1 --with-slurm=yes --with-slurm-include=/cm/shared/apps/slurm/current/include --with-slurm-lib=/cm/shared/apps/slurm/current/lib --with-pmi=slurm --without-cuda --without-hip --with-device=ch4:ucx --with-ucx=/home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb --enable-libxml2 --enable-thread-cs=per-vci --with-datatype-engine=auto
MPICH CC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gcc -O2
MPICH CXX: /home/jackyan1/workspace/spack/lib/spack/env/gcc/g++ -O2
MPICH F77: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2
MPICH FC: /home/jackyan1/workspace/spack/lib/spack/env/gcc/gfortran -O2
MPICH features: threadcomm

Any idea why this could happen?

@raffenet
Copy link
Contributor

Can you confirm your MPICH library and hello_world are linked with the Slurm PMI library? The output suggests each process thinks it is a singleton, so something is wrong in the discovery of other processes in the job.

@JiakunYan
Copy link
Author

According to the output of ldd, it seems it did link to the slurm pmi library.

srun -n 1 ldd ~/workspace/hpx-lci_scripts/spack_env/expanse/hpx-lcw/.spack-env/view/bin/hello_world
linux-vdso.so.1 (0x0000155555551000)
liblcw.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lcw-master-qdc2ohhyw7cfzumwivkojiilsto66qlh/lib64/liblcw.so (0x000015555511a000)
libstdc++.so.6 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libstdc++.so.6 (0x0000155554d47000)
libm.so.6 => /lib64/libm.so.6 (0x00001555549c5000)
libgcc_s.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libgcc_s.so.1 (0x00001555547ac000)
libc.so.6 => /lib64/libc.so.6 (0x00001555543e7000)
liblci.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci.so (0x00001555541c1000)
liblct.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblct.so (0x0000155553f78000)
libibverbs.so.1 => /lib64/libibverbs.so.1 (0x0000155553d58000)
libmpicxx.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpicxx.so.0 (0x0000155553b35000)
libmpi.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/mpich-master-4uxeueqze7pn3732cbji36ckezyqld4o/lib/libmpi.so.0 (0x00001555534a2000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000155553282000)
/lib64/ld-linux-x86-64.so.2 (0x0000155555325000)
liblci-ucx.so => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/lci-master-ladq3cnji4lzds5pvnmz3pr5kpskvhvs/lib64/liblci-ucx.so (0x0000155553011000)
libnl-route-3.so.200 => /lib64/libnl-route-3.so.200 (0x0000155552d7f000)
libnl-3.so.200 => /lib64/libnl-3.so.200 (0x0000155552b5c000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000155552958000)
libhwloc.so.15 => /home/jackyan1/opt/hwloc/2.9.1/lib/libhwloc.so.15 (0x00001555526f9000)
libpciaccess.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libpciaccess-0.17-jqqzmoorywzwslxnvh3whvxmxgggxddg/lib/libpciaccess.so.0 (0x00001555524ef000)
libxml2.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libxml2-2.10.3-riigwi634oahw6njkyhbrhqjx2hsbjyt/lib/libxml2.so.2 (0x0000155552184000)
libucp.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucp.so.0 (0x0000155551eb6000)
libucs.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucs.so.0 (0x0000155551c55000)
libyaksa.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/yaksa-0.2-3r62jn5cdiiovsmntoqdrkzircgzvxqh/lib/libyaksa.so.0 (0x000015554f989000)
libxpmem.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xpmem-2.6.5-36-n47tincumvgfjwbnhddzsqskzs7nxohd/lib/libxpmem.so.0 (0x000015554f786000)
librt.so.1 => /lib64/librt.so.1 (0x000015554f57e000)
libpmi.so.0 => /cm/shared/apps/slurm/current/lib64/libpmi.so.0 (0x000015554f378000)
libz.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/zlib-1.2.13-xhijn7cz7apogelukw47ulnzhhardvos/lib/libz.so.1 (0x000015554f160000)
liblzma.so.5 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/xz-5.4.1-knnmdfklcssmtvciq4pupvfqsh2upbzy/lib/liblzma.so.5 (0x000015554ef33000)
libiconv.so.2 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/libiconv-1.17-fdzdmyikb3i5dtfkt26raiyq63tumvnq/lib/libiconv.so.2 (0x000015554ec26000)
libuct.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libuct.so.0 (0x000015554e9eb000)
libnuma.so.1 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/numactl-2.0.14-k3pqb32bk6b5sl2c7kvzd6errjicvsye/lib/libnuma.so.1 (0x000015554e7df000)
libucm.so.0 => /home/jackyan1/workspace/spack/opt/spack/linux-rocky8-zen2/gcc-10.2.0/ucx-1.14.0-znbygw2dkj6m2uvbebmufqmkggyapleb/lib/libucm.so.0 (0x000015554e5c4000)
libslurm_pmi.so => /cm/shared/apps/slurm/23.02.7/lib64/slurm/libslurm_pmi.so (0x000015554e1d2000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x000015554dfba000)
libatomic.so.1 => /cm/shared/apps/spack/0.17.3/cpu/b/opt/spack/linux-rocky8-zen/gcc-8.5.0/gcc-10.2.0-npcyll4gxjhf4tejksmdzlsl3d3usqpd/lib64/libatomic.so.1 (0x000015554ddb2000)

I also tried the --mpi=pmi2 option of srun and got a different error:

srun -n 2 --mpi=pmi2 hello_world
[cli_0]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-0-seg-1/2 value=20D0539CE2EDC88B822000637577CC2B3200F8D74F[5]4F030088C70230AC7B6CE151210600020A150136CF1917B7513832A8384F9241AF360113230082BB9321060002C6CA67C9CF1917B7513832A8384F9241AF360013230082C8B7211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082C119210600020A160136CF1917B751383A0B345023ADAE360113230082A92342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883B808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882D75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AB73014A0C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C27301B30F77CCAB338142364FF28969352603270083C8730125030094057E3977CC2B33EF483850A2E73B3532DF2300072B9701490C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003419701B30F77CCAB33EF4838500DDA3335320327008347970126088F9A9BAD3BDF1A0A478BBD3706360024:
[cli_1]: write_line: message string doesn't end in newline: :cmd=put kvsname=28943582.7 key=-allgather-shm-1-1-seg-1/2 value=2056E44C658714485C2000637577CC2B3200F8D74F[5]4F0300884C479A1F197DC693210600020A150136CF1917B7513832A8384F9241AF360113230082E5BD21060002C6CA67C9CF1917B7513832A8384F9241AF36001323008293FD211201028F9A9BAD3BDF1A8A98[2]F0[4]CF1917B75138539E3E4BD7E037370113230082854F210600020A160136CF1917B751383A0B345023ADAE36011323008295F342088F9A9BAD3BDF1A0A8D5377CC2B32[2]705077CCAB33004F2300883C808067[4]43088F9A9BAD3BDF1A0AA7D377CC2B32[2]705077CCAB33004F2300882E75180006[2]C0241148[10]FFFF0A1501367E3977CC2B338142364F8797713526DF230007AC7301490C[2]278AB00FA1338142364F5C7C613526DF22[2]7AD477CC2B338142364F5C7C613526DF2200010031D15C7CE1338142364FF28969352603270003C17301B30F77CCAB338142364FF28969352603270083C7730125030094057E3977CC2B33EF483850A2E73B3532DF2300072A97014B0C[2]278AB00FA133EF48385077CC2B3532DF22[2]7AD477CC2B33EF48385077CC2B3532DF2200010031D15C7CE133EF4838500DDA33353203270003409701B30F77CCAB33EF4838500DDA3335320327008346970126088F9A9BAD3BDF1A0A478BBD3706360024:
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=28943582.7 tasks 0-1: running
^Csrun: sending Ctrl-C to StepId=28943582.7
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

@hzhou
Copy link
Contributor

hzhou commented Feb 27, 2024

The srun --mpi=pmi2 is working. But looks like the exchange address string gets too long to fit the PMI message limit. Not sure where the inconsistency comes from.

@JiakunYan
Copy link
Author

JiakunYan commented May 3, 2024

For the error reported when srun --mpi=pmi2, manually modifying MPICH source code and reducing pmi_max_val_size by half fixed this issue. I would appreciate it if MPICH could provide an environmental variable for users to control the value (like the I_MPI_PMI_VALUE_LENGTH_MAX environmental variable in impi)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants