Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems executing mpirun in parallel, it hangs and sends the message: ORTE does not know how to route a message to the specified daemon located on the indicated node #12476

Open
jonny261 opened this issue Apr 18, 2024 · 0 comments

Comments

@jonny261
Copy link

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? 4.1.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I installed it with apt-get install openmpi-bin and libopenmpi-dev.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 22.04.3 LTS

Details of the problem

I have a master with IP (192.168.1.10) and 4 nodes with IPs (.20, .30, .40, .50). I configured passwordless SSH, and from the master, I can access each node without using a password. I installed pssh, and I can run commands in parallel on each node from the master. I installed NFS, created a directory, mounted it on each node, and it works. I installed OpenMPI, and when I try to run 'mpirun -hostfile hosts ./hello_world

It hangs, and I have to do Ctrl + Z to cancel it, and it shows me this message
^Z mpirun Forwarding signal 20 to job

ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: master-H510M-H
target node: 192.168.1.20

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

[master-H510M-H] 3 more processes have sent help message help-errmgr-base.txt / no-path
[master-H510M-H] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Could you help me solve this error and be able to execute in parallel?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant