New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parcelport fails to initialize when multiple jobs run on the same cluster #6097
Comments
Disabling the TCP parcelport should help. How did you disable it? |
I've used -DHPX_WITH_PARCELPORT_MPI=ON -DHPX_WITH_PARCELPORT_TCP=OFF when building HPX. |
Could you give us the error message you see in this case, please? |
I used to get an error message along the lines of failure to initialise Parcelport before, but now it's crashing with the following:
I'm not quite sure of what changed for this to now crash instead, still looking to reproduce previous behaviour, but there is no difference between the code run now with MPI parcelport and the initial code using TCP. |
Expected Behavior
Expected is that multiple independent jobs (e.g. SLURM job array) can run concurrently on the same cluster (on disjoint sets of nodes, not co-scheduled).
Actual Behavior
Only one (or none) of the jobs is able to run while all others crash at initialization with the following errors:
Steps to Reproduce the Problem
Schedule multiple jobs on a SLURM cluster without dependences and only using a subset of the nodes (so allowing the SLURM scheduler to start multiple instances on separate partitions).
Tried to use MPI parcelport and disable TCP to no avail (error changes, but still fails to initialize).
Specifications
The text was updated successfully, but these errors were encountered: