New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #3339
Comments
@samuelkgutierrez This is the same problem we've seen with the Any thoughts on how we might chase this down? For now, I could just disable the shmem2 component on the release branches, though I would rather not as we have no non-OMPI reports of a problem. It would be good if we could at least find a way to isolate the problem between PMIx and OMPI so we know which side is the root cause. |
@rhc54, I think the prudent thing to do is to disable As far as ideas go, nothing substantive comes to mind at the moment. I wonder if this issue has something to do with the cross-version work we did recently? I'm thinking maybe something in those paths is subtly broken, but I couldn't say for sure. Based on some of the git-bisect reports that I've seen, that's my best guess. Sorry I can't be more helpful at the moment. |
Understood. Given that we have no plans for an immediate release, I'll probably leave the component "active" for now so I can play with it if/when time permits. I don't trust git bisect to correctly identify things that might be the result of interactions between components, but I agree it should be easy to separate the cross-version code out. Eventually, we are going to have to figure out a way to support the shmem2 component - I hate to drop it as it has a disproportionate impact on large systems such as those at the national labs, yet I can't consistently come out of retirement to deal with it (and I'm not pointing at you!). We can discuss at the next meeting. |
That all sounds reasonable to me, @rhc54. Thank you. |
Okay, I have figured out the problem. Has nothing to do with the commit identified by The problem is that the index of the user-defined keys gets out-of-phase between the server and the clients - they differ by one in the OMPI case, which causes us to return the value from the first BTL component (for ob1 systems) instead of the PML base value. The drift gets worse when PRRTE is running as a DVM and multiple jobs are executed. Ultimately, the problem lies in how we treat modex data. When a proc calls So the shmem index doesn't match the index assigned by the internal hash of the clients, and we return wrong values. This is why disabling shmem solves the problem - the data flows to the hash of the clients as key-value pairs, and the key-index conversion remains internally consistent on each client. I'm looking at a couple of solutions. Easiest is just to push all modex info into the hash component, but on large jobs there could be sizable modex info. Another option is to update the shared dictionary when storing modex info - more complicated, but might be able to find a way to make it work. Only issue there is that a given client might store some other key solely on their hash, and then we conflict because the index is occupied. Have to think about that one. |
I'm glad you found the issue, @rhc54. Thanks for the explanation of the problem and possible solutions. |
Does this means that if a single process introduces a key unknown to the others, all the processes will get invalid or maybe inconsistent modex values ? |
Not exactly. The problem isn't in the modex. It lies in the way we convert the string keys to integer indices when storing the values in the shmem region. At the moment, the key-to-index dictionary in the server and client gets out-of-sync with each other, and that causes us to return the value from a different key than the one requested. So if a process publishes a unique key unknown to the others, and that key appears before other common ones, then the indices will be off. If that key comes after all the other keys, then you'd never see the problem. Actually doesn't matter if the value gets "put" and included in the modex operation - there are other pathways by which data can enter the shmem, and it would have similar issues. Which is why everything is fine when we disable shmem and just use the hash storage. In that case, we always pass the string key and value pair, and the client locally converts the key to an index that is unique to itself. No issue with getting indices out-of-sync between the server and client. I'm working on the easy fix now - updating dictionaries is more complicated and needs a lot more thought. The critical element for catching this is to always use the PMIx type checking to ensure the value returned is at least the type that was expected. In the case cited above, that immediately caught the problem via the patch I provided in open-mpi/ompi#12475. |
The type checking provided in the patch only works if the key it was confused with has a different type, so it is more of a band aid than a real solution. |
It isn't meant as a solution - it provides a diagnostic so you aren't chasing an empty string. In this case, it (a) told me what was wrong (returning a different type), which then (b) allowed me to simply print out the keys to see that we were off by one and returning the btl.tcp value instead of the pml.base. Bottom line being: we went from having no idea what was wrong, to having useful diagnostic info that pointed one straight to the problem. |
I've been working with the OpenMPI folks on an issue which might be related to pmix 5.0.2. Short version of the story is if I back down to pmix 4.2.9 things work fine. If I set PMIX_MCA_gds=hash with pmix 5.0.2 things work fine as well.
Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).
The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.
Bad behavior:
Good behavior:
Looking for some next steps on troubleshooting the issue. Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: