Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dalcinl
Copy link
Contributor

@dalcinl dalcinl commented May 6, 2024

Addressing #12349 (comment).

Copy link

github-actions bot commented May 6, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

1ca576e: mpi4py: Run mpi4py tests with a relocated Open MPI...

  • check_signed_off: does not contain a valid Signed-off-by line

1163918: mpi4py: Support for workflow_dispatch trigger

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
@dalcinl
Copy link
Contributor Author

dalcinl commented May 6, 2024

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

@wenduwan
Copy link
Contributor

wenduwan commented May 6, 2024

bot:aws:retest

Comment on lines +105 to +106
repository: ${{ inputs.repository || 'mpi4py/mpi4py' }}
ref: ${{ inputs.ref }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have default value for inputs.repository so should we drop || 'mpi4py/mpi4py'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That default applies to the workflow_dispatch trigger to run the workflow via the GitHub Web UI. For the case of the pull_request trigger, these input values are null and eventually evaluate to an empty string. Therefore, to the best of my knowledge, we need the || 'mpi4py/mpi4py'.

.github/workflows/ompi_mpi4py_tests.yaml Outdated Show resolved Hide resolved
@rhc54
Copy link
Contributor

rhc54 commented May 6, 2024

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working.

Is there something else that is wrong?

@wenduwan wenduwan added the mpi4py-all Run the optional mpi4py CI tests label May 6, 2024
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
@dalcinl
Copy link
Contributor Author

dalcinl commented May 7, 2024

I tested singleton spawn,

Indeed, the spawn tests run in isolation of other tests seem to be working fine.
Did you finally merge these changes related to not using accept/connect and move to something more robust?

Is there something else that is wrong?

Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN.

@dalcinl
Copy link
Contributor Author

dalcinl commented May 7, 2024

@rhc54 The following command line is the minimal set of tests to trigger the issue:

env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm

I'm seeing the following lines in the output:

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
[kw61149:364081] OPAL ERROR: Unreachable in file ../../../ompi-main/ompi/communicator/comm.c at line 2372
[kw61149:364081] 0: Error in ompi_comm_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

@rhc54
Copy link
Contributor

rhc54 commented May 7, 2024

Did you finally merge these changes related to not using accept/connect and move to something more robust?

Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��

Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago.

Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer?

Anyway, the problem is indeed fixed - just not over there for some reason.

@janjust
Copy link
Contributor

janjust commented May 7, 2024

I'll push a submodule update to this PR shortly

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

@janjust do you have bandwidth to update this PR? I can do it otherwise.

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one.

@wenduwan
Copy link
Contributor

Howard leapfrogged me with #12565 will likely merge that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: main
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants