You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to launch an MPMD job using a rankfile that uses relative indexing (+n0 for the first hostname, +n1 for the second etc.), on a system that uses the SLURM job scheduler
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI v4.1.4.
sbatch --version report 23.02.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Not certain (the system administrators installed it), but I looks like it was compiled from source.
Please describe the system on which you are running
Operating system/version: RHEL8
Computer hardware: Each node has 2 x Intel Broadwell E5-2695 v4 (18 cores), 128 GB RAM
Network type: Intel OmniPath
Details of the problem
When I try to launch an MPMD program using mpiexec -rf rankfile.txt to specify the layout of the ranks, I get an error if:
the rankfile uses relative indexing for the hostnames (+n0, +n1 etc.)
Replacing the relative indexing with actual hostnames makes the problem go away
The second application is on a different host that then first
There are enough ranks
The attached example fills up 2 nodes, the first node with 36 ranks the first application, the second node with 36 ranks of the second application. Putting 1 rank of application 1 on the first node and 1 rank of application 2 on the second node does not produce an error
The error I get is:
--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of allocated hosts.
--------------------------------------------------------------------------
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 271
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 402
I did allocate 2 nodes, so +n1 should have been a valid hostname.
A minimal reproducer is attached. To run it, do sbatch ./batch_reduced.sh.
As noted in batch_reduced.sh, running 72 ranks of a single application works fine, so this is somehow related to running 2 applications with 36 ranks each. mpirun_error.tar.gz
The text was updated successfully, but these errors were encountered:
Background information
I am trying to launch an MPMD job using a rankfile that uses relative indexing (
+n0
for the first hostname,+n1
for the second etc.), on a system that uses the SLURM job schedulerWhat version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
OpenMPI v4.1.4.
sbatch --version report 23.02.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Not certain (the system administrators installed it), but I looks like it was compiled from source.
Please describe the system on which you are running
Details of the problem
When I try to launch an MPMD program using
mpiexec -rf rankfile.txt
to specify the layout of the ranks, I get an error if:+n0
,+n1
etc.)The error I get is:
I did allocate 2 nodes, so
+n1
should have been a valid hostname.A minimal reproducer is attached. To run it, do
sbatch ./batch_reduced.sh
.As noted in
batch_reduced.sh
, running 72 ranks of a single application works fine, so this is somehow related to running 2 applications with 36 ranks each.mpirun_error.tar.gz
The text was updated successfully, but these errors were encountered: