Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PRRTE git submodule to point to Open MPI fork #12449

Merged
merged 3 commits into from May 2, 2024

Conversation

jsquyres
Copy link
Member

Shift away from the upstream PRRTE (https://github.com/openpmix/prrte) and point to an Open MPI-specific fork for future releases of Open MPI. The direct consequence of this is that the PRRTE bundled in Open MPI will now be an Open MPI-specific fork of PRRTE.

Tighter integration between Open MPI and this bundled, Open MPI-specific PRRTE will likely happen over time, but right now, it's functionallty identical as the upstream PRRTE.

Note: this commit both updates to the current upstream PRRTE master branch tip (5cde35d439) plus one additional Open MPI-specific commit (1d867e8498) that edits the PRRTE VERSION file to include "ompi" to the version string.


This commit effectively means that if we pull/cherry-pick/otherwise obtain a future commit git hash X from upstream PRRTE's git repo, it will have git hash Y on the Open MPI PRRTE repo. This may not be a big deal, but it is worth noting.


When checking out this branch or updating main after this PR merges, developers must re-sync their submodules:

git submodule sync
git submodule update --recursive --remote 3rd-party/prrte

Given that we have not updated the PRRTE submodule URL for the Open MPI v5.0.x branch, a submodule sync (and possible submodule update) will be necessary every time you switch between the Open MPI main and v5.0.x branches.

@hppritcha
Copy link
Member

oh one of those destructor failures again. i thought i'd fixed that but maybe there's another.

@hppritcha
Copy link
Member

git bisect implicates openpmix/prrte@6e9417a

@nmww2aviation
Copy link

@hppritcha Can you post the actual error here so I can look into it? I'm continuing to cleanup group operations - I'll bet, though, that the problem here is because you didn't update the PMIx submodule to sync with the changes in PRRTE. The fix for local group ops required a coordinated change between the two packages.

@rhc54
Copy link
Contributor

rhc54 commented Apr 2, 2024

Sigh - sorry about that last note, I was logged into a different account!

@hppritcha
Copy link
Member

yep it looks like advancing openpmix sha to 2f225b0748febe4df3695ec89d16a33ce155394e causes the problem to go away, at least for -np 1.

@rhc54
Copy link
Contributor

rhc54 commented Apr 2, 2024

You folks will need to decide how you want to handle this for release branches. The group rework is not being brought across to the PMIx/PRRTE release branches, but that also means that some of the group-related problems won't be fixed there. We can discuss when things get to an appropriate point, if you like.

@hppritcha
Copy link
Member

too bad. looks like we're on to the next failure.

@rhc54
Copy link
Contributor

rhc54 commented Apr 2, 2024

Well, let me know if it is PMIx related - I'll try to help give you a solid launch point for your PRRTE fork.

@hppritcha
Copy link
Member

looks like somethings changed with PMIx_Group_construct that causes MPI_intercomm_from_groups to fail. Going back to what's head of main now on ompi (including shas for prrte/openpmix) the two processor runs are successful.

@rhc54
Copy link
Contributor

rhc54 commented Apr 2, 2024

Hmmm...that seems odd. Testing with current head of PMIx/PRRTE master branches shows that PMIx_Group_construct with the "assign_ctx_id" attribute works fine (this test includes grp info, but that's irrelevant):

$ prterun -n 2 ./group_lcl_cid
Client ns prterun-Ralphs-iMac-59184@1 rank 0 pid 59362: Running
Client ns prterun-Ralphs-iMac-59184@1 rank 1 pid 59363: Running
Rank 1 Group construct complete with status PMIX_SUCCESS KEY pmix.grp.id CID 4294967295
Rank 0 Group construct complete with status PMIX_SUCCESS KEY pmix.grp.id CID 4294967295
prterun-Ralphs-iMac-59184@1:1: PMIx_Get LOCAL CID for rank 0 SUCCESS value:  PMIX_VALUE:  Data type: PMIX_SIZE	Value: 1234
prterun-Ralphs-iMac-59184@1:0: PMIx_Get LOCAL CID for rank 0 SUCCESS value:  PMIX_VALUE:  Data type: PMIX_SIZE	Value: 1234
prterun-Ralphs-iMac-59184@1:0: PMIx_Get LOCAL CID for rank 1 SUCCESS value:  PMIX_VALUE:  Data type: PMIX_SIZE	Value: 1235
prterun-Ralphs-iMac-59184@1:1: PMIx_Get LOCAL CID for rank 1 SUCCESS value:  PMIX_VALUE:  Data type: PMIX_SIZE	Value: 1235
prterun-Ralphs-iMac-59184@1:0 COMPLETE
prterun-Ralphs-iMac-59184@1:1 COMPLETE

@rhc54
Copy link
Contributor

rhc54 commented Apr 2, 2024

I also built OMPI main with the head of PMIx/PRRTE and ran a simple spawn test, which also uses the comm_cid code that calls PMIx_Group_construct, and it worked fine:

$ mpirun -n 2 ./simple_spawn
[prterun-Ralphs-iMac-80906@1:1 pid 80908] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@1:0 pid 80907] starting up on node Ralphs-iMac.local!
0 completed MPI_Init
1 completed MPI_Init
Parent [pid 80907] about to spawn!
Parent [pid 80908] about to spawn!
[prterun-Ralphs-iMac-80906@2:0 pid 80909] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@2:1 pid 80910] starting up on node Ralphs-iMac.local!
[prterun-Ralphs-iMac-80906@2:2 pid 80911] starting up on node Ralphs-iMac.local!
1 completed MPI_Init
Hello from the child 1 of 3 on host Ralphs-iMac.local pid 80910
2 completed MPI_Init
Hello from the child 2 of 3 on host Ralphs-iMac.local pid 80911
0 completed MPI_Init
Hello from the child 0 of 3 on host Ralphs-iMac.local pid 80909
Parent done with spawn
Parent done with spawn
Parent sending message to child
Child 0 received msg: 38
Child 1 disconnected
Child 2 disconnected
Parent disconnected
Parent disconnected
Child 0 disconnected
80911: exiting
80908: exiting
80907: exiting
80910: exiting

Perhaps you could share what you are doing, and what error you see? Is this a case again where we run a gazillion things and then eventually call something that accesses this code? If so, then maybe it isn't this specific code that is the source of the problem.

@rhc54
Copy link
Contributor

rhc54 commented Apr 3, 2024

Not sure who (if anyone) is monitoring the upstream PRRTE repo, but I posted a fix this morning to a problem cited by AMD last night (openpmix/prrte@f01e2a2) - it affects their web-based map display.

We should probably chat sometime about if/how you plan to keep an eye on the upstream repo as there are changes which may impact you.

@rhc54
Copy link
Contributor

rhc54 commented Apr 3, 2024

@hppritcha You might try adding this patch, simply to help diagnostics:

diff --git a/.github/workflows/ompi_mpi4py.yaml b/.github/workflows/ompi_mpi4py.yaml
index f23f20fb2f..bed157bc26 100644
--- a/.github/workflows/ompi_mpi4py.yaml
+++ b/.github/workflows/ompi_mpi4py.yaml
@@ -65,6 +65,9 @@ jobs:
         mca_params="$HOME/.prte/mca-params.conf"
         mkdir -p "$(dirname "$mca_params")"
         echo rmaps_default_mapping_policy = :oversubscribe >> "$mca_params"
+        mca_params="$HOME/.pmix/mca-params.conf"
+        mkdir -p "$(dirname "$mca_params")"
+        echo gds = hash >> "$mca_params"

     - name: Show MPI
       run:  ompi_info

Unfortunately, disabling PMIx shmem doesn't mean that the shmem component is the cause of the problem. The component would be sensitive to OMPI memory corruption, so it may just be getting stomped on. Not sure how it would tie into PMIx_Group_construct, though, so it may prove to be a red herring. Still, maybe worth a try?

@hppritcha
Copy link
Member

thanks Ralph. i'll take a look with the patch. note the problem doesn't seem to have anything to do with spawn/connect accept however.

@jsquyres
Copy link
Member Author

jsquyres commented Apr 4, 2024

Not sure who (if anyone) is monitoring the upstream PRRTE repo,

Yes, we need to figure out how we're going to manage the OMPI fork of PRRTE; part of the reason I created this PR was to force that conversation. We didn't have a Tuesday call this week to discuss it.

hppritcha added a commit to hppritcha/pmix that referenced this pull request Apr 4, 2024
PR 3329 introduced a regression which causes Open MPI's
MPI_Intercomm_from_groups method to fail.

Related to open-mpi/ompi#12449

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Apr 4, 2024
PR 3329 introduced a regression which causes Open MPI's
MPI_Intercomm_from_groups method to fail.

Related to open-mpi/ompi#12449

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
rhc54 pushed a commit to rhc54/openpmix that referenced this pull request Apr 4, 2024
PR 3329 introduced a regression which causes Open MPI's
MPI_Intercomm_from_groups method to fail.

Related to open-mpi/ompi#12449

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 213956c)
rhc54 pushed a commit to openpmix/openpmix that referenced this pull request Apr 4, 2024
PR 3329 introduced a regression which causes Open MPI's
MPI_Intercomm_from_groups method to fail.

Related to open-mpi/ompi#12449

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
(cherry picked from commit 213956c)
jsquyres and others added 2 commits April 29, 2024 10:07
Shift away from the upstream PRRTE (https://github.com/openpmix/prrte)
and point to an Open MPI-specific fork for future releases of Open
MPI.  The direct consequence of this is that the PRRTE bundled in Open
MPI will now be an Open MPI-specific fork of PRRTE.

Tighter integration between Open MPI and this bundled, Open
MPI-specific PRRTE will likely happen over time, but right now, it's
functionallty identical as the upstream PRRTE.

Note: this commit both updates to the current upstream PRRTE master
branch tip (5cde35d439) plus one additional Open MPI-specific commit
(1d867e8498) that edits the PRRTE VERSION file to include "ompi" to
the version string.

Signed-off-by: Jeff Squyres <jeff@squyres.com>
2f225b0748febe4df3695ec89d16a33ce155394e

Fix local group operations and cleanup to pass all examples

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha
Copy link
Member

@jsquyres i'd like to push some fixes to this PR to pass mpi4py.

@rhc54
Copy link
Contributor

rhc54 commented Apr 29, 2024

@hppritcha Do you want to just resync the master branch with upstream? I believe all the fixes are in it.

also rebase to pull in some cid fixes for inter communicators

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
@hppritcha hppritcha self-requested a review April 29, 2024 21:49
@hppritcha
Copy link
Member

@jsquyres i think this is ready when you want to merge. probably want to ping on slack/email about the

git submodule sync
git submodule update --recursive 

@jsquyres jsquyres merged commit 7495db7 into open-mpi:main May 2, 2024
13 checks passed
@jsquyres jsquyres deleted the pr/shift-prte-git-submodule branch May 2, 2024 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants