Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST MPI4PY #12460

Closed
wants to merge 2 commits into from
Closed

TEST MPI4PY #12460

wants to merge 2 commits into from

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Apr 9, 2024

bot:notacherrypick

bot:notacherrypick

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@wenduwan wenduwan added mpi4py-all Run the optional mpi4py CI tests ⚠️ WIP-DNM! labels Apr 9, 2024
@github-actions github-actions bot added this to the v5.0.4 milestone Apr 9, 2024
@wenduwan
Copy link
Contributor Author

wenduwan commented Apr 9, 2024

Saw this in the log

Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_create_from_group/MPI_Intercomm_create_from_groups
  Reason:       PMIx server unreachable
--------------------------------------------------------------------------

@hppritcha Is this the same issue that you successfully tracked down?

@hppritcha
Copy link
Member

This looks like the singleton spawn problem we saw with ompi main a couple of months ago. I will take a look

@wenduwan
Copy link
Contributor Author

@hppritcha Thanks for checking. During yesterday's call @naughtont3 volunteered to investigate the issue. I was poking you to see if that rings any bell.

@rhc54
Copy link
Contributor

rhc54 commented Apr 10, 2024

Just a suggestion: I doubt the fix @hppritcha mentions is in your v5.0 branch yet, unless you have updated the PMIx and PRRTE submodules in the last couple of weeks.

@wenduwan
Copy link
Contributor Author

Thanks. I updated openpmix to include Howard's fix. Running the tests again.

@wenduwan
Copy link
Contributor Author

Howard's fix does quell create intercomm from groups error but the spawn failure remains

 test_apply (test_util_pool.TestProcessPool.test_apply) ... Exception in thread Thread-6 (_manager_spawn):
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 349, in _manager_spawn
    comm = serialized(client_spawn)(pyexe, pyargs, nprocs, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.12.2/x64/lib/python3.12/site-packages/mpi4py/futures/_core.py", line 1056, in client_spawn
    comm = MPI.COMM_SELF.Spawn(python_exe, args, max_workers, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI/Comm.pyx", line 2545, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_UNKNOWN: unknown error

@rhc54
Copy link
Contributor

rhc54 commented Apr 10, 2024

Try pulling PRRTE as well - any singleton comm_spawn issues would be there.

@wenduwan
Copy link
Contributor Author

Bumped both pmix and prrte to head of master branch.

@wenduwan
Copy link
Contributor Author

Back to square 1

[fv-az525-673:02388] [[30249,1],2] selected pml ob1, but peer [[30249,1],0] on fv-az525-673 selected pml 
�
[fv-az525-673:02388] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[fv-az525-673:02388] 1: Error in ompi_get_rprocs
[fv-az525-673:02387] [[30249,1],1] selected pml ob1, but peer [[30249,1],0] on fv-az525-673 selected pml 
�
[fv-az525-673:02387] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[fv-az525-673:02387] 0: Error in ompi_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR
======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ompi/ompi/test/test_ulfm.py", line 196, in setUpClass
    INTERCOMM = MPI.Intracomm.Create_intercomm(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI/Comm.pyx", line 2335, in mpi4py.MPI.Intracomm.Create_intercomm
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
----------------------------------------------------------------------
Ran 1666 tests in 32.766s
FAILED (errors=1, skipped=83)
setUpClass (test_ulfm.TestULFMInter) ... ERROR
======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ompi/ompi/test/test_ulfm.py", line 196, in setUpClass
    INTERCOMM = MPI.Intracomm.Create_intercomm(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI/Comm.pyx", line 2335, in mpi4py.MPI.Intracomm.Create_intercomm
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
----------------------------------------------------------------------
Ran 1666 tests in 32.814s
FAILED (errors=1, skipped=83)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[30249,1],1]
  Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented Apr 10, 2024

Not quite that bad - we know the ULFM tests won't pass (documented elsewhere and being worked on by Lisandro), so they should probably be turned "off" here.

@wenduwan
Copy link
Contributor Author

I want to confirm if this was a regression. Reverted to openpmix v4.2.8 + prrte v3.0.3

@rhc54
Copy link
Contributor

rhc54 commented Apr 10, 2024

One thing you can do to more directly check is just see how far down the tests you get. If the ulfm tests come after the singleton comm_spawn, then you know that you got further than before. However, if everything passes with the reverted submodule pointers, then why not just stay there for this release branch?

@wenduwan
Copy link
Contributor Author

So the test passed with the previous openpmix + prrte combo which confirms that this is a regression. This is not news.

Now that we have already released with the new submodule pointers I don't think we can simply revert. We have to fix the problem and move forward.

@hppritcha
Copy link
Member

checking this out.

@hppritcha
Copy link
Member

hmm... i can't reproduce on my usual aarch64 cluster and yes i did use the shas from 94e834e

@bosilca
Copy link
Member

bosilca commented Apr 11, 2024

The ULFM tests should pass. They passed before, so if the intercomm and the singleton issues are fixed, then all ULFM tests shall pass.

@wenduwan
Copy link
Contributor Author

@hppritcha in my branch I reverted prrte/pmix to 3.0.3/4.2.8

You should be able to reproduce with master branches of both.

@hppritcha
Copy link
Member

@wenduwan at which np count do you first observe the failures?

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2024

Please see #12464 for a pass/fail matrix of PMIx/PRRTE combinations

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2024

I will address the PMIx v5.0.2 problem - @hppritcha I could definitely use some help from @samuelkgutierrez, if available.

bot:notacherrypick

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
@bosilca
Copy link
Member

bosilca commented Apr 11, 2024

@rhc54 how do we use the matrix in #12464 ? What test it trace ? All mpi4py tests ?

@wenduwan
Copy link
Contributor Author

at which np count do you first observe the failures?

@hppritcha I pointed back to pmix 5.0.2 + prrte 3.0.4 and it is failing at np=2

@wenduwan
Copy link
Contributor Author

In case anyone wonders, previously I did a bisect and found the failure to start with this commit openpmix/openpmix@6163f21

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2024

Yeah, I don't necessarily believe that bisect result. We'll dig into it.

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2024

@rhc54 how do we use the matrix in #12464 ? What test it trace ? All mpi4py tests ?

Yes (according to the labeling), it is running all the mpi4py tests. The intent was to (a) identify if there are issues relating to both PMIx and PRRTE, or just one of them, or...? and (b) provide a starting place to investigate one or the other without having to simultaneously deal with multiple issues. I didn't bother recording the specifics of the failures - I figure that can be investigated while looking at the code itself.

@rhc54
Copy link
Contributor

rhc54 commented Apr 11, 2024

at which np count do you first observe the failures?

@hppritcha I pointed back to pmix 5.0.2 + prrte 3.0.4 and it is failing at np=2

Per the matrix, the key is PMIx v5.0.2 - all versions of PRRTE beyond v3.0.3 fail these tests with that version of PMIx.

@wenduwan wenduwan closed this Apr 15, 2024
@wenduwan wenduwan deleted the v5.0.x_mpi4py branch April 15, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: v5.0.x ⚠️ WIP-DNM!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants