Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Srun with intelmpi hang if multi node. #9612

Open
vyscenkoh opened this issue Nov 27, 2023 · 1 comment
Open

Srun with intelmpi hang if multi node. #9612

vyscenkoh opened this issue Nov 27, 2023 · 1 comment

Comments

@vyscenkoh
Copy link

Describe the bug
This is the last output line before it hang forever
[0] MPI_Startup(): libfabric provider: verbs;ofi_rxm
No error reported. No traffic flow between nodes.

To Reproduce
I have a fresh environment with intelmpi2021.11, Libfabric 1.18.1-ipmi, slurm 21.08.8-2, and RoCEv2 network
Both intelmpi and openmpi using mpirun single/multi node: ok.
Openmpi using srun single/multi node: ok
Intelmpi using srun with single node: ok
Intelmpi using srun with multi node: not ok

Environment:
Rockylinux 8.6

@j-xiong
Copy link
Contributor

j-xiong commented Mar 27, 2024

Sorry for the late response.

If you set FI_LOG_LEVEL=warn I would expect to see some warning messages about connection failure. There may be something wrong in the network setup that prevented rdma-cm from working properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants