mca_pml_ob1_recv_frag_callback_match occasional segfault #12495

bhendersonPlano · 2024-04-25T18:46:33Z

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

5.0.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

self compiled with hwloc (2.10.0), pmix (5.0.2), and slurm (23.11.05)

Please describe the system on which you are running

Operating system/version: RHEL 9.3
Computer hardware: AMD dual socket node
Network type: single Intel E810-C card with one port active/node (100Gbit)

Details of the problem

We are running 8 node jobs with 8 ranks per node and seeing an occasional segmentation fault during MPI_Init. It appears to impact some number of ranks on a single host when it happens. Sometimes it is just one rank that aborts but we've seen as many as 6 - all on the same node. We are using srun for launch with the environment variable PMIX_MCA_gds=hash as a workaround for another issue.

Stacktrace shows:

[cn04:1194785] *** Process received signal ***
[cn04:1194785] Signal: Segmentation fault (11)
[cn04:1194785] Signal code: Address not mapped (1)
[cn04:1194785] Failing at address: 0xe0
[cn04:1194785] [ 0] /lib64/libc.so.6(+0x54db0)[0x7f54e6254db0]
[cn04:1194785] [ 1] /share/openmpi/5.0.3/lib/libmpi.so.40(mca_pml_ob1_recv_frag_callback_match+0x7d)[0x7f54e67eab3d]
[cn04:1194785] [ 2] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0xa7d8c)[0x7f54e6566d8c]
[cn04:1194785] [ 3] /lib64/libevent_core-2.1.so.7(+0x21b88)[0x7f54e649cb88]
[cn04:1194785] [ 4] /lib64/libevent_core-2.1.so.7(event_base_loop+0x577)[0x7f54e649e7a7]
[cn04:1194785] [ 5] /share/openmpi/5.0.3/lib/libopen-pal.so.80(+0x222af)[0x7f54e64e12af]
[cn04:1194785] [ 6] /share/openmpi/5.0.3/lib/libopen-pal.so.80(opal_progress+0x85)[0x7f54e64e1365]
[cn04:1194785] [ 7] /share/openmpi/5.0.3/lib/libmpi.so.40(ompi_mpi_init+0x46d)[0x7f54e663ce7d]
[cn04:1194785] [ 8] /share/openmpi/5.0.3/lib/libmpi.so.40(MPI_Init+0x5e)[0x7f54e66711ae]
[cn04:1194785] [ 9] /home/brent/bin/ior-3.0.1/ior[0x403780]
[cn04:1194785] [10] /lib64/libc.so.6(+0x3feb0)[0x7f54e623feb0]
[cn04:1194785] [11] /lib64/libc.so.6(__libc_start_main+0x80)[0x7f54e623ff60]
[cn04:1194785] [12] /home/brent/bin/ior-3.0.1/ior[0x4069d5]
[cn04:1194785] *** End of error message ***

The system core file size limit is set to unlimited, but I didn't find any core files lying around.

I tried some experiments this afternoon and ran 1000 back to back hello_mpi jobs using srun launch - 6 of them hit this issue. I then ran over 3000 salloc + mpirun hello_mpi jobs and didn't see the issue.

Thoughts on next steps in debugging this issue? Maybe I should consider dropping back to pmix 4.2.9 and see how that goes?

The text was updated successfully, but these errors were encountered:

bhendersonPlano · 2024-04-25T21:33:25Z

rebuilt slurm and openmpi to use pmi 4.2.9 and dropped the PMIX_MCA_gds=hash setting . Ran ~3000 hello_world jobs in this environment without seeing any core dumps.

nmww2aviation · 2024-04-25T21:41:44Z

Just to be clear: you originally said you ran 3000 jobs with mpirun using PMIx 5.0.2 and saw no problems. So I'm assuming your last test refers to executing with srun and not mpirun - yes??

I fail to see a connection between PMIx and ob1/recv being caught in a segfault - we don't have anything to do with the MPI message exchange. Likewise, it's hard to see what srun has to do with it, so I have no idea what to suggest. Given everything you have encountered across the two issue reports, I suspect there is something more fundamentally borked in this system.

rhc54 · 2024-04-25T21:44:11Z

My apologies - blasted github had me logged into a different account when I wrote the above note. Sigh.

bhendersonPlano · 2024-04-25T21:58:52Z

no worries - thanks for taking a look at this for me.

Yep, the new testing using an srun launch with a PMIx 4.2.9 based slurm/openmpi did not see any core dumps in ~3000 runs. I'll stick with this new setup for now since things seem happier.

If you can think of any env variables I can set to provide more debug information, please let me know and I can give them a try and report back what I find.

rhc54 · 2024-04-26T13:08:38Z

Gave this some thought - given that things work fine under mpirun but fail under srun, I'm inclined to think there is some problem in the Slurm-PMIx integration when using PMIx 5.x. I know nothing about debugging Slurm, so I would really encourage you to file a ticket with SchedMD. At the very least, they should be made aware of the situation in case others encounter it.

It still feels to me like there is something else in your environment causing the problem (and the PMIx change being just a canary or flat out red herring), but minus more info, I have no idea how to pursue it.

bhendersonPlano · 2024-05-01T16:35:32Z

one last note to add here before closing this one out and turning my focus to the Slurm/SchedMD side of the house. Two interesting things:

adding strace in front of the hello_world_mpi application buries/hides/avoids the issue
removing options using cgroups from the slurm.conf appears buries/hides/avoids the issue

Turns out that I had disabled cgroups in my testing area earlier and forgotten about it. My comments above about PMIx impacting this issue should be ignored. Much more likely the change in my slurm configuration in my test environment that changed the launch behavior.

rhc54 · 2024-05-01T16:44:44Z

You may already know this, but be aware that SchedMD changed the srun cmd line in 23.11 - you might need to make some adjustments.

janjust · 2024-05-02T16:15:17Z

@bhendersonPlano If this issue is not in OMPI rather SLURM or PMIX can you please file in corresponding community and close here?

bhendersonPlano · 2024-05-02T19:24:14Z

I've started a thread on the slurm-users mailing list - hopefully someone will chime in there.

I'll close this one out as it does not appear to be an OpenMPI issue.

jsquyres added the Target: v5.0.x label May 2, 2024

janjust added the State-Awaiting user information label May 2, 2024

github-actions bot removed the State-Awaiting user information label May 2, 2024

janjust closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mca_pml_ob1_recv_frag_callback_match occasional segfault #12495

mca_pml_ob1_recv_frag_callback_match occasional segfault #12495

bhendersonPlano commented Apr 25, 2024

bhendersonPlano commented Apr 25, 2024

nmww2aviation commented Apr 25, 2024

rhc54 commented Apr 25, 2024

bhendersonPlano commented Apr 25, 2024

rhc54 commented Apr 26, 2024

bhendersonPlano commented May 1, 2024

rhc54 commented May 1, 2024

janjust commented May 2, 2024

bhendersonPlano commented May 2, 2024

mca_pml_ob1_recv_frag_callback_match occasional segfault #12495

mca_pml_ob1_recv_frag_callback_match occasional segfault #12495

Comments

bhendersonPlano commented Apr 25, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

bhendersonPlano commented Apr 25, 2024

nmww2aviation commented Apr 25, 2024

rhc54 commented Apr 25, 2024

bhendersonPlano commented Apr 25, 2024

rhc54 commented Apr 26, 2024

bhendersonPlano commented May 1, 2024

rhc54 commented May 1, 2024

janjust commented May 2, 2024

bhendersonPlano commented May 2, 2024