job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339

bhendersonPlano · 2024-04-19T15:58:40Z

I've been working with the OpenMPI folks on an issue which might be related to pmix 5.0.2. Short version of the story is if I back down to pmix 4.2.9 things work fine. If I set PMIX_MCA_gds=hash with pmix 5.0.2 things work fine as well.

Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).

The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.

Bad behavior:

$ srun --mpi=pmix -N 2 -n 2 ./hello_mpi.503 
[cn04:545610] [[55359,0],1] selected pml ob1, but peer [[55359,0],0] on cn03 selected pml 

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 138.0 ON cn03 CANCELLED AT 2024-04-18T13:05:02 ***
srun: error: cn04: task 1: Exited with exit code 14
srun: Terminating StepId=138.0
srun: error: cn03: task 0: Killed
$

Good behavior:

$ env PMIX_MCA_gds=hash srun --mpi=pmix -N 2 -n 2 ./hello_mpi.503 
Hello from rank 0 on cn03
Hello from rank 1 on cn04
$

Looking for some next steps on troubleshooting the issue. Thanks in advance for your help!

The text was updated successfully, but these errors were encountered:

rhc54 · 2024-04-19T16:05:39Z

@samuelkgutierrez This is the same problem we've seen with the mpi4py test suite in OMPI, which is a different environment (no Slurm, single node). I've been totally unable to reproduce it. It's possible that the root cause lies somewhere in OMPI, perhaps someone free'ing a value too soon. The OPAL "modex" scripts are a little out-of-date as they come from a time when OMPI required an abstraction layer over the PMI-related implementations, so it could be there is something wrong in them.

Any thoughts on how we might chase this down? For now, I could just disable the shmem2 component on the release branches, though I would rather not as we have no non-OMPI reports of a problem. It would be good if we could at least find a way to isolate the problem between PMIx and OMPI so we know which side is the root cause.

samuelkgutierrez · 2024-04-19T16:19:19Z

@rhc54, I think the prudent thing to do is to disable shmem2 outright. I fear that I won't have time in the near-term to diagnose this issue properly, unfortunately.

As far as ideas go, nothing substantive comes to mind at the moment. I wonder if this issue has something to do with the cross-version work we did recently? I'm thinking maybe something in those paths is subtly broken, but I couldn't say for sure. Based on some of the git-bisect reports that I've seen, that's my best guess. Sorry I can't be more helpful at the moment.

rhc54 · 2024-04-19T16:33:29Z

Understood. Given that we have no plans for an immediate release, I'll probably leave the component "active" for now so I can play with it if/when time permits. I don't trust git bisect to correctly identify things that might be the result of interactions between components, but I agree it should be easy to separate the cross-version code out.

Eventually, we are going to have to figure out a way to support the shmem2 component - I hate to drop it as it has a disproportionate impact on large systems such as those at the national labs, yet I can't consistently come out of retirement to deal with it (and I'm not pointing at you!). We can discuss at the next meeting.

samuelkgutierrez · 2024-04-19T16:36:21Z

That all sounds reasonable to me, @rhc54. Thank you.

rhc54 · 2024-04-20T01:14:46Z

Okay, I have figured out the problem. Has nothing to do with the commit identified by git bisect - just a red herring.

The problem is that the index of the user-defined keys gets out-of-phase between the server and the clients - they differ by one in the OMPI case, which causes us to return the value from the first BTL component (for ob1 systems) instead of the PML base value. The drift gets worse when PRRTE is running as a DVM and multiple jobs are executed.

Ultimately, the problem lies in how we treat modex data. When a proc calls PMIx_Put, we place that data into its own internal hash table - which means the key gets assigned an index. We then push that data up to the local server, which places it in its own internal hash table - and assigns its own index to it. When we execute the fence/modex, the server pulls the data from its hash by key (i.e., it packs the key-value pair and not the index) - but when the server subsequently puts the data into the shmem region, each key gets assigned another index...and that index doesn't get updated to sync across the clients.

So the shmem index doesn't match the index assigned by the internal hash of the clients, and we return wrong values. This is why disabling shmem solves the problem - the data flows to the hash of the clients as key-value pairs, and the key-index conversion remains internally consistent on each client.

I'm looking at a couple of solutions. Easiest is just to push all modex info into the hash component, but on large jobs there could be sizable modex info. Another option is to update the shared dictionary when storing modex info - more complicated, but might be able to find a way to make it work. Only issue there is that a given client might store some other key solely on their hash, and then we conflict because the index is occupied. Have to think about that one.

samuelkgutierrez · 2024-04-20T14:01:13Z

I'm glad you found the issue, @rhc54. Thanks for the explanation of the problem and possible solutions.

bosilca · 2024-04-20T19:32:14Z

Does this means that if a single process introduces a key unknown to the others, all the processes will get invalid or maybe inconsistent modex values ?

rhc54 · 2024-04-20T20:51:35Z

Not exactly. The problem isn't in the modex. It lies in the way we convert the string keys to integer indices when storing the values in the shmem region. At the moment, the key-to-index dictionary in the server and client gets out-of-sync with each other, and that causes us to return the value from a different key than the one requested.

So if a process publishes a unique key unknown to the others, and that key appears before other common ones, then the indices will be off. If that key comes after all the other keys, then you'd never see the problem. Actually doesn't matter if the value gets "put" and included in the modex operation - there are other pathways by which data can enter the shmem, and it would have similar issues.

Which is why everything is fine when we disable shmem and just use the hash storage. In that case, we always pass the string key and value pair, and the client locally converts the key to an index that is unique to itself. No issue with getting indices out-of-sync between the server and client.

I'm working on the easy fix now - updating dictionaries is more complicated and needs a lot more thought. The critical element for catching this is to always use the PMIx type checking to ensure the value returned is at least the type that was expected. In the case cited above, that immediately caught the problem via the patch I provided in open-mpi/ompi#12475.

bosilca · 2024-04-20T22:39:18Z

The type checking provided in the patch only works if the key it was confused with has a different type, so it is more of a band aid than a real solution.

rhc54 · 2024-04-20T23:35:12Z

It isn't meant as a solution - it provides a diagnostic so you aren't chasing an empty string. In this case, it (a) told me what was wrong (returning a different type), which then (b) allowed me to simply print out the keys to see that we were off by one and returning the btl.tcp value instead of the pml.base.

Bottom line being: we went from having no idea what was wrong, to having useful diagnostic info that pointed one straight to the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

samuelkgutierrez commented Apr 19, 2024

rhc54 commented Apr 19, 2024

samuelkgutierrez commented Apr 19, 2024

rhc54 commented Apr 20, 2024

samuelkgutierrez commented Apr 20, 2024

bosilca commented Apr 20, 2024

rhc54 commented Apr 20, 2024

bosilca commented Apr 20, 2024

rhc54 commented Apr 20, 2024

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339

Comments

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

samuelkgutierrez commented Apr 19, 2024

rhc54 commented Apr 19, 2024

samuelkgutierrez commented Apr 19, 2024

rhc54 commented Apr 20, 2024

samuelkgutierrez commented Apr 20, 2024

bosilca commented Apr 20, 2024

rhc54 commented Apr 20, 2024

bosilca commented Apr 20, 2024

rhc54 commented Apr 20, 2024