Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open MPI fails with 480 processes on a single node #12489

Open
jstrodtb opened this issue Apr 23, 2024 · 4 comments
Open

Open MPI fails with 480 processes on a single node #12489

jstrodtb opened this issue Apr 23, 2024 · 4 comments
Labels

Comments

@jstrodtb
Copy link

Thank you for taking the time to submit an issue!

Background information

I am testing OpenFOAM on a Power 10 server node with 768 hardware threads. If I run -np 768 (anything over about 256, really), Open MPI crashes due to the operating system being out of file handles. I have increased the number of handles to 64k, and it still runs out. Another MPI code, LAMMPS, runs out at np = 240.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OS distribution package

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version: RHEL 9
  • Computer hardware: A single IBM Power 10 server node
  • Network type: None(?).

Details of the problem

I am running the OpenFOAM motorbike test with various mesh sizes. I expect to be able to run with MPI processes populating all the hardware threads, so -np 768. However, the program crashes with an operating system error reporting insufficient file handles. This happens on other MPI codes when the process count is well over 200.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
@devreal
Copy link
Contributor

devreal commented Apr 24, 2024

Sounds like the file limits on that machine are too low. Try running ulimit -n 2048 to increase that limit.

See https://stackoverflow.com/questions/34588/how-do-i-change-the-number-of-open-files-limit-in-linux for details.

Copy link

github-actions bot commented May 8, 2024

It looks like this issue is expecting a response, but hasn't gotten one yet. If there are no responses in the next 2 weeks, we'll assume that the issue has been abandoned and will close it.

@github-actions github-actions bot added the Stale label May 8, 2024
@jstrodtb
Copy link
Author

@devreal the upper limit on files is 65536. Upon further testing, the failure happens at around np = 250. 65536 = 256^2, so that tracks (obviously, the system has other file handles open).

Is it possible that Open MPI is creating a direct connection between each process that lives on the same node? That would explain this np^2 behavior.

@ggouaillardet
Copy link
Contributor

That can happen if communications use TCP, but that should not be the case by default.
Try

mpirun --mca pml ob1 --mca btl self,vader -np 768 ...

to force the shared memory component.

mpirun --mca pml_base_verbose 100 --mca btl_base_verbose 100 -np 768 ...

should tell you what is going on by default.
You will get more info if you configure Open MPI with --enable-debug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants