Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run openMPI from two machines #12493

Open
sacharin1993 opened this issue Apr 25, 2024 · 4 comments
Open

Unable to run openMPI from two machines #12493

sacharin1993 opened this issue Apr 25, 2024 · 4 comments

Comments

@sacharin1993
Copy link

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

It should be the latest version from openFoam installation (a 4 version) but I also built the latest version from your website (version5)
I probably have two version 4 and 5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

I have installed latest version of openFoam 2312 which comes with a openMPI version 4.
I also built the latest version 5 from your website

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: Ubuntu 2204
  • Computer hardware: MB=ASUS KRPA-U16 and
  • AMD EPYC7713P on both machines
  • Network type: I am using one ethernet port to connect the two machines

Details of the problem

I have two machines A and B with identical HW and SW. They seem to have no problems in ssh and sharing a folder (on A).
I can regularly run an example (as the hello_c.c from example folder) or a openfoam simulation in parallel using the 64 cores of a single machine with a command : ...$ mpirun -np 64 ./hello . Either on A and B machine.
If I try to run both machines as for example ...$ mpirun --hostfile /etc/hosts -np 168 ./hello the terminal hangs and no output is shown (error messages neither).
I am attaching some of the configurations of my system and the strace final part of the command ...$ strace mpirun --hostfile /etc/hosts -np 128 ./hello

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

documents.zip

@ggouaillardet
Copy link
Contributor

/etc/hosts is not a valid Open MPI hostfile.

@sacharin1993
Copy link
Author

sacharin1993 commented Apr 25, 2024

Thanks for the reply but I am not sure I understand what you mean I should change.
In the past using a etc.hosts file as the ones I attached in the former message I managed to have openMPI run correctly with two nodes.
Even if I did not mention it, I have also tried -this time - using a "machines" text file to read nodes from A and B (just these names or IP and the names). If I just write A and have openMPI run with 64 cores $ mpirun --hostfile machines -np 64 ./hello , it works. If I change the name to B or A and B (and changing -np to 128 ) and launch mpirun from A , it does not work. Justs hungs as with the other situation (/etc/hosts)

@ggouaillardet
Copy link
Contributor

make sure there is no firewall between both hosts (passwordless ssh is necessary but not enough). With Open MPI v4, try restricting to known interfaces, for example mpirun --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants