-
Notifications
You must be signed in to change notification settings - Fork 840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526
base: main
Are you sure you want to change the base?
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: 1ca576e: mpi4py: Run mpi4py tests with a relocated Open MPI...
1163918: mpi4py: Support for workflow_dispatch trigger
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
1ca576e
to
007eb6a
Compare
@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested. |
bot:aws:retest |
repository: ${{ inputs.repository || 'mpi4py/mpi4py' }} | ||
ref: ${{ inputs.ref }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have default value for inputs.repository
so should we drop || 'mpi4py/mpi4py'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That default applies to the workflow_dispatch
trigger to run the workflow via the GitHub Web UI. For the case of the pull_request
trigger, these input values are null and eventually evaluate to an empty string. Therefore, to the best of my knowledge, we need the || 'mpi4py/mpi4py'
.
I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working. Is there something else that is wrong? |
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
007eb6a
to
37427d5
Compare
Indeed, the spawn tests run in isolation of other tests seem to be working fine.
Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN. |
@rhc54 The following command line is the minimal set of tests to trigger the issue: env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm I'm seeing the following lines in the output:
|
Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.
Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago. Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer? Anyway, the problem is indeed fixed - just not over there for some reason. |
I'll push a submodule update to this PR shortly |
@janjust do you have bandwidth to update this PR? I can do it otherwise. |
Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one. |
Howard leapfrogged me with #12565 will likely merge that instead. |
Addressing #12349 (comment).