Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rankfile and MPMD Issue #12446

Open
JaredCrean2 opened this issue Mar 29, 2024 · 0 comments
Open

Rankfile and MPMD Issue #12446

JaredCrean2 opened this issue Mar 29, 2024 · 0 comments

Comments

@JaredCrean2
Copy link

Background information

I am trying to launch an MPMD job using a rankfile that uses relative indexing (+n0 for the first hostname, +n1 for the second etc.), on a system that uses the SLURM job scheduler

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI v4.1.4.
sbatch --version report 23.02.7

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Not certain (the system administrators installed it), but I looks like it was compiled from source.

Please describe the system on which you are running

  • Operating system/version: RHEL8
  • Computer hardware: Each node has 2 x Intel Broadwell E5-2695 v4 (18 cores), 128 GB RAM
  • Network type: Intel OmniPath

Details of the problem

When I try to launch an MPMD program using mpiexec -rf rankfile.txt to specify the layout of the ranks, I get an error if:

  • the rankfile uses relative indexing for the hostnames (+n0, +n1 etc.)
    • Replacing the relative indexing with actual hostnames makes the problem go away
  • The second application is on a different host that then first
  • There are enough ranks
    • The attached example fills up 2 nodes, the first node with 36 ranks the first application, the second node with 36 ranks of the second application. Putting 1 rank of application 1 on the first node and 1 rank of application 2 on the second node does not produce an error

The error I get is:

--------------------------------------------------------------------------
Rankfile claimed host +n1 by index that is bigger than number of allocated hosts.
--------------------------------------------------------------------------
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file rmaps_rank_file.c at line 271
[ec534:05640] [[7352,0],0] ORTE_ERROR_LOG: Bad parameter in file base/rmaps_base_map_job.c at line 402

I did allocate 2 nodes, so +n1 should have been a valid hostname.

A minimal reproducer is attached. To run it, do sbatch ./batch_reduced.sh.

As noted in batch_reduced.sh, running 72 ranks of a single application works fine, so this is somehow related to running 2 applications with 36 ranks each.
mpirun_error.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants