job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

bhendersonPlano · 2024-04-18T18:30:28Z

I'm having an issue with a two node job launch when using srun and looking for some help. A normal mpirun invoking mpirun from within an salloc works fine.

The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.

Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).

Build options are:

./configure \
     --prefix=/home/brent/sys/openmpi/5.0.3 \
     --with-hwloc=/home/brent/sys/hwloc/2.10.0 \
     --with-slurm=/home/brent/sys/slurm/23.11.5 \
     --with-pmix=/home/brent/sys/pmix/5.0.2 \
     --disable-ipv6 \
     --enable-orterun-prefix-by-default

Slurm was also built with the same versions of hwloc and pmix.

A simple test works fine with mpirun directly:

$ mpirun -n 2 -H cn03,cn04 ./hello_mpi.503
Hello from rank 0 on cn03
Hello from rank 1 on cn04

An mpirun from within an salloc works fine:

$ salloc -w cn03,cn04 --ntasks-per-node=1 mpirun ./hello_mpi.503 
salloc: Granted job allocation 136
salloc: Nodes cn[03-04] are ready for job
Hello from rank 0 on cn03
Hello from rank 1 on cn04
salloc: Relinquishing job allocation 136

However a srun does not work:

[cn04:545610] [[55359,0],1] selected pml ob1, but peer [[55359,0],0] on cn03 selected pml 

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 138.0 ON cn03 CANCELLED AT 2024-04-18T13:05:02 ***
srun: error: cn04: task 1: Exited with exit code 14
srun: Terminating StepId=138.0
srun: error: cn03: task 0: Killed

Note that on the error line, the second node is not showing what it did select - which seems odd. It does not matter which two nodes I select, the first node (cn04 above) complains about the second node (cn03 above) not matching in the pml selection.

I added OMPI_MCA_btl_base_verbose=100 OMPI_MCA_pml_ob1_verbose=100 to my launch, cleaned up the output a little and then diffed the two. No differences were shown up to the point where the error message comes out.

$ diff good bad
26c26
< [cn04] btl:tcp: 0x904660: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
---
> [cn04] btl:tcp: 0x24e5830: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
57c57
< [cn03] btl:tcp: 0xc33660: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
---
> [cn03] btl:tcp: 0x1a69830: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
63,84c63,70
< [cn03] mca: bml: Using self btl for send to [[56142,1],0] on node cn03
< [cn04] mca: bml: Using self btl for send to [[56142,1],1] on node cn04
< Hello from rank 0 on cn03
< [cn04] mca: bml: Using tcp btl for send to [[56142,1],0] on node cn03
< [cn04] btl: tcp: attempting to connect() to [[56142,1],0] address 10.23.0.13 on port 1024
< [cn04] btl:tcp: would block, so allowing background progress
< [cn03] mca: bml: Using tcp btl for send to [[56142,1],1] on node cn04
< [cn03] btl: tcp: attempting to connect() to [[56142,1],1] address 10.23.0.14 on port 1024
< [cn03] btl:tcp: would block, so allowing background progress
< [cn03] btl:tcp: connect() to 10.23.0.14:1024 completed (complete_connect), sending connect ACK
< [cn03] btl:tcp: now connected to 10.23.0.14, process [[56142,1],1]
< [cn04] btl:tcp: connect() to 10.23.0.13:1024 completed (complete_connect), sending connect ACK
< [cn04] btl:tcp: now connected to 10.23.0.13, process [[56142,1],0]
< Hello from rank 1 on cn04
< [cn04] mca: base: close: component self closed
< [cn04] mca: base: close: unloading component self
< [cn04] mca: base: close: component tcp closed
< [cn04] mca: base: close: unloading component tcp
< [cn03] mca: base: close: component self closed
< [cn03] mca: base: close: unloading component self
< [cn03] mca: base: close: component tcp closed
< [cn03] mca: base: close: unloading component tcp
---
> [cn03] mca: bml: Using self btl for send to [[43367,0],0] on node cn03
> [cn04] [[43367,0],1] selected pml ob1, but peer [[43367,0],0] on cn03 selected pml 
> 
> slurmstepd: error: *** STEP 131.0 ON cn03 CANCELLED AT 2024-04-18T10:42:56 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: cn04: task 1: Exited with exit code 14
> srun: Terminating StepId=131.0
> srun: error: cn03: task 0: Killed

I tried with OpenMPI 4.1.6 as well but ran into similar issues. Figured it would be easier to debug with 5.x and it is where I would like to end up as well. Any guidance on next steps would be appreciated.

The text was updated successfully, but these errors were encountered:

bhendersonPlano · 2024-04-18T19:10:45Z

bad_run_output.txt

rhc54 · 2024-04-18T21:01:39Z

Looks like everything ran okay, but then OMPI tries to re-establish connections during finalize - which fails because the messaging system already shutdown.

bosilca · 2024-04-18T21:02:41Z

Both processes initialize the BTLs, so it seems obvious they both selected the OB1 PML. But somehow that information is now properly exchaged between them. More precisely, it sounds as if two peers were in different modex collecting modes, one in direct and one in full, so they take different path in mca_pml_base_pml_check_selected. Or, and this would be the worst case scenario, the modex is somehow corrupted.

Let's make sure they both provide the same PML name. Put a breakpoint or a printf in mca_pml_base_pml_selected and dump the process name and the PML selected (or not).

rhc54 · 2024-04-18T21:05:45Z

@bosilca If you look carefully at his output, you can see that we successfully completed MPI_Init and outputted his "hello" message. We then enter MPI_Finalize, the various btl components close - and then we re-enter a connection attempt, which fails. So it isn't the modex or anything in the initial channel that is failing - the question is: why are we attempting to reconnect after we finalize the BTLs?

bosilca · 2024-04-18T21:07:02Z

@rhc54 that is not correct, you seems to be confusing the "correct" and the "bad" output. If you take a look at the complete output posted you will notice the lack of application output, aka. the hello message is nowhere to be found, so we are not yet out of the MPI_Init.

rhc54 · 2024-04-18T21:08:25Z

I see - very confusing the way that was presented. I stand corrected.

bhendersonPlano · 2024-04-18T21:37:24Z

Sorry for being confusing - will try a different approach next time.

Looks like it is using the same pml on both processes.

$ srun --reservation=brent -N 2 -n 2 /home/brent/src/mpi/hello_mpi/hello_mpi.503
DEBUG: process 90471 is using pml ob1
DEBUG: process 89501 is using pml ob1
[cn10:89501] [[4419,0],1] selected pml ob1, but peer [[4419,0],0] on cn09 selected pml 

slurmstepd: error: *** STEP 166.0 ON cn09 CANCELLED AT 2024-04-18T16:34:52 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cn09: task 0: Killed
srun: Terminating StepId=166.0
srun: error: cn10: task 1: Exited with exit code 14

rhc54 · 2024-04-18T21:56:44Z

FWIW: the code that exchanges the PML name, when combined with the OPAL_MODEX_foomacros to send/recv the data, probably isn't written as it should be. It treats strings as byte objects, which means you lose all the type checking PMIx provides. It would be better if one actually "put" a string, and then received a string, to ensure that things are properly matched and the strings terminated.

Given that mpirun is working correctly and you are getting identical results for both OMPI v4.1 and v5.0, I'm guessing it is something in the handoff with Slurm that is causing the problem. I note that you didn't specify -mpi=pmix on your srun cmd line - but I'm guessing you must have that locked in the Slurm config file? We know there are some issues with Slurm 23.11 (though the ones we are currently working thru aren't related to this), so it could be another problem in there as well. Have you filed an issue with them?

bhendersonPlano · 2024-04-18T22:03:14Z

Yep, we have 'MpiDefault=pmix' in the slurm.conf file so I didn't need it on the srun command line. Good suggestion about the slurm 23.11.x build. I may drop back a major revision and see if things 'just work' before opening an issue with the schedmd folks. Thanks for the help.

…

On Thu, Apr 18, 2024 at 4:57 PM Ralph Castain ***@***.***> wrote: FWIW: the code that exchanges the PML name, when combined with the OPAL_MODEX_foomacros to send/recv the data, probably isn't written as it should be. It treats strings as byte objects, which means you lose all the type checking PMIx provides. It would be better if one actually "put" a string, and then received a string, to ensure that things are properly matched and the strings terminated. Given that mpirun is working correctly and you are getting identical results for both OMPI v4.1 and v5.0, I'm guessing it is something in the handoff with Slurm that is causing the problem. I note that you didn't specify -mpi=pmix on your srun cmd line - but I'm guessing you must have that locked in the Slurm config file? We know there are some issues with Slurm 23.11 (though the ones we are currently working thru aren't related to this), so it could be another problem in there as well. Have you filed an issue with them? — Reply to this email directly, view it on GitHub <#12475 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BH5S4SJKSK7OAY245BHH3S3Y6A6TDAVCNFSM6AAAAABGN2B6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGM4TENBVGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rhc54 · 2024-04-19T13:52:47Z

Please let us know if things work correctly with a different Slurm version. You might also try adding PMIX_MCA_gds=hash to your environment to see if that makes a difference - just forces use of a different key-value store mechanism in case there is some corruption in the default one.

If you feel like it, you could also apply the following patch to your OMPI code - it will check that the PML string we get back from the exchange is in fact a string (and hasn't been corrupted). Since the error message is printing out an empty string, it could well be that the returned value is something non-string (and we mistakenly just assume it is a string).

diff --git a/ompi/mca/pml/base/pml_base_select.c b/ompi/mca/pml/base/pml_base_select.c
index e66ec02b5e..8536dfaf7f 100644
--- a/ompi/mca/pml/base/pml_base_select.c
+++ b/ompi/mca/pml/base/pml_base_select.c
@@ -17,6 +17,7 @@
  * Copyright (c) 2020-2022 Amazon.com, Inc. or its affiliates.  All Rights
  * Copyright (c) 2018-2020 Triad National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2024      Nanook Consulting  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -239,17 +240,6 @@ int mca_pml_base_select(bool enable_progress_threads,
     return ret;
 }

-/* need a "commonly" named PML structure so everything ends up in the
-   same modex field */
-static mca_base_component_t pml_base_component = {
-    OMPI_MCA_BASE_VERSION_2_1_0("pml", 2, 0, 0),
-    .mca_component_name = "base",
-    .mca_component_major_version = 2,
-    .mca_component_minor_version = 0,
-    .mca_component_release_version = 0,
-};
-
-
 /*
  * If direct modex, then publish PML for all procs. If full modex then
  * publish PML for rank 0 only. This information is used during add_procs
@@ -268,11 +258,19 @@ static mca_base_component_t pml_base_component = {
 int
 mca_pml_base_pml_selected(const char *name)
 {
-    int rc = 0;
+    int rc = OMPI_SUCCESS;
+    pmix_value_t val;

+    /* if we were told not to perform the check, then no point
+     * in unnecessarily adding to the modex payload
+     */
+    if (!ompi_pml_base_check_pml) {
+        return OMPI_SUCCESS;
+    }
     if (!opal_pmix_collect_all_data || 0 == OMPI_PROC_MY_NAME->vpid) {
-        OPAL_MODEX_SEND(rc, PMIX_GLOBAL, &pml_base_component, name,
-                        strlen(name) + 1);
+        PMIX_VALUE_LOAD(&val, name, PMIX_STRING);
+        rc = PMIx_Put(PMIX_GLOBAL, "OMPI_PML_BASE", &val);
+        PMIX_VALUE_DESTRUCT(&val);
     }
     return rc;
 }
@@ -281,9 +279,11 @@ static int
 mca_pml_base_pml_check_selected_impl(const char *my_pml,
                                      opal_process_name_t proc_name)
 {
-    size_t size;
     int ret = 0;
     char *remote_pml;
+    pmix_proc_t proc;
+    pmix_value_t *val = NULL;
+    pmix_info_t optional;

     /* if we are proc_name=OMPI_PROC_MY_NAME, then we can also assume success */
     if (0 == opal_compare_proc(ompi_proc_local()->super.proc_name, proc_name)) {
@@ -291,9 +291,13 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
                             "check:select: PML check not necessary on self");
         return OMPI_SUCCESS;
     }
-    OPAL_MODEX_RECV_STRING(ret,
-                           mca_base_component_to_string(&pml_base_component),
-                           &proc_name, (void**) &remote_pml, &size);
+
+    /* Note that we cannot tell PMIx to just check our local modex info
+     * because we might have requested a direct modex
+     */
+    OPAL_PMIX_CONVERT_NAME(&proc, &proc_name);
+    ret = PMIx_Get(&proc, "OMPI_PML_BASE", NULL, 0, &val);
+
     if (PMIX_ERR_NOT_FOUND == ret) {
         opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                             "check:select: PML modex for process %s not found",
@@ -305,12 +309,22 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
      * wasn't returned, but just to be safe, and since the check
      * is fast...let's be sure
      */
-    if (NULL == remote_pml) {
+    if (NULL == val) {
         opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                             "check:select: got a NULL pml from process %s",
                             OMPI_NAME_PRINT(&proc_name));
         return OMPI_ERR_UNREACH;
     }
+    if (PMIX_STRING != val->type) {
+        opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
+                            "check:select: got a non-string (%s) pml from process %s",
+                            PMIx_Data_type_string(val->type), OMPI_NAME_PRINT(&proc_name));
+        PMIX_VALUE_RELEASE(val);
+        return OMPI_ERR_UNREACH;
+    }
+    remote_pml = val->data.string;
+    val->data.string = NULL;  // protect data
+    PMIX_VALUE_RELEASE(val);

     opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                         "check:select: checking my pml %s against process %s"
@@ -318,18 +332,24 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
                         remote_pml);

     /* if that module doesn't match my own, return an error */
-    if ((size != strlen(my_pml) + 1) ||
-        (0 != strcmp(my_pml, remote_pml))) {
+    if (0 != strcmp(my_pml, remote_pml)) {
         char *errhost = NULL;
-        OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_HOSTNAME, &proc_name,
-                                       &(errhost), PMIX_STRING);
+        PMIX_INFO_LOAD(&optional, PMIX_OPTIONAL, NULL, PMIX_BOOL);
+        ret = PMIx_Get(&proc, PMIX_HOSTNAME, &optional, 1, &val);
+        if (PMIX_SUCCESS == ret && NULL != val && PMIX_STRING == val->type) {
+            errhost = val->data.string;
+            val->data.string = NULL;   // protect data
+            PMIX_VALUE_RELEASE(val);
+        }
         opal_output(0, "%s selected pml %s, but peer %s on %s selected pml %s",
                     OMPI_NAME_PRINT(&ompi_proc_local()->super.proc_name),
                     my_pml, OMPI_NAME_PRINT(&proc_name),
                     (NULL == errhost) ? "unknown" : errhost,
                     remote_pml);
         free(remote_pml);
-        free(errhost);
+        if (NULL != errhost) {
+            free(errhost);
+        }
          /* cleanup before returning */
         return OMPI_ERR_UNREACH;
     }

Note that you'll still need to put --mca pml_base_verbose 10 on your cmd line to see the error report, and this won't fix anything - just provide a little more debug info.

bhendersonPlano · 2024-04-19T14:35:35Z

Thanks Ralph - I backed down to slurm 23.02.7 (with the same hwloc and pmix) and saw the same issue. My next experiment is to back down PMIx from 5.0.2 to 4.2.9. Next on the list is the PMIX_MCA_gds=hash setting and then patch above with the pml_base_verbose. I'll keep you posted on what I find out

…

On Fri, Apr 19, 2024 at 8:53 AM Ralph Castain ***@***.***> wrote: Please let us know if things work correctly with a different Slurm version. You might also try adding PMIX_MCA_gds=hash to your environment to see if that makes a difference - just forces use of a different key-value store mechanism in case there is some corruption in the default one. If you feel like it, you could also apply the following patch to your OMPI code - it will check that the PML string we get back from the exchange is in fact a string (and hasn't been corrupted). Since the error message is printing out an empty string, it could well be that the returned value is something non-string (and we mistakenly just assume it is a string). diff --git a/ompi/mca/pml/base/pml_base_select.c b/ompi/mca/pml/base/pml_base_select.c index e66ec02b5e..8536dfaf7f 100644--- a/ompi/mca/pml/base/pml_base_select.c+++ b/ompi/mca/pml/base/pml_base_select.c@@ -17,6 +17,7 @@ * Copyright (c) 2020-2022 Amazon.com, Inc. or its affiliates. All Rights * Copyright (c) 2018-2020 Triad National Security, LLC. All rights * reserved.+ * Copyright (c) 2024 Nanook Consulting All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow@@ -239,17 +240,6 @@ int mca_pml_base_select(bool enable_progress_threads, return ret; } -/* need a "commonly" named PML structure so everything ends up in the- same modex field */-static mca_base_component_t pml_base_component = {- OMPI_MCA_BASE_VERSION_2_1_0("pml", 2, 0, 0),- .mca_component_name = "base",- .mca_component_major_version = 2,- .mca_component_minor_version = 0,- .mca_component_release_version = 0,-};-- /* * If direct modex, then publish PML for all procs. If full modex then * publish PML for rank 0 only. This information is used during add_procs@@ -268,11 +258,19 @@ static mca_base_component_t pml_base_component = { int mca_pml_base_pml_selected(const char *name) {- int rc = 0;+ int rc = OMPI_SUCCESS;+ pmix_value_t val; + /* if we were told not to perform the check, then no point+ * in unnecessarily adding to the modex payload+ */+ if (!ompi_pml_base_check_pml) {+ return OMPI_SUCCESS;+ } if (!opal_pmix_collect_all_data || 0 == OMPI_PROC_MY_NAME->vpid) {- OPAL_MODEX_SEND(rc, PMIX_GLOBAL, &pml_base_component, name,- strlen(name) + 1);+ PMIX_VALUE_LOAD(&val, name, PMIX_STRING);+ rc = PMIx_Put(PMIX_GLOBAL, "OMPI_PML_BASE", &val);+ PMIX_VALUE_DESTRUCT(&val); } return rc; }@@ -281,9 +279,11 @@ static int mca_pml_base_pml_check_selected_impl(const char *my_pml, opal_process_name_t proc_name) {- size_t size; int ret = 0; char *remote_pml;+ pmix_proc_t proc;+ pmix_value_t *val = NULL;+ pmix_info_t optional; /* if we are proc_name=OMPI_PROC_MY_NAME, then we can also assume success */ if (0 == opal_compare_proc(ompi_proc_local()->super.proc_name, proc_name)) {@@ -291,9 +291,13 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml, "check:select: PML check not necessary on self"); return OMPI_SUCCESS; }- OPAL_MODEX_RECV_STRING(ret,- mca_base_component_to_string(&pml_base_component),- &proc_name, (void**) &remote_pml, &size);++ /* Note that we cannot tell PMIx to just check our local modex info+ * because we might have requested a direct modex+ */+ OPAL_PMIX_CONVERT_NAME(&proc, &proc_name);+ ret = PMIx_Get(&proc, "OMPI_PML_BASE", NULL, 0, &val);+ if (PMIX_ERR_NOT_FOUND == ret) { opal_output_verbose( 10, ompi_pml_base_framework.framework_output, "check:select: PML modex for process %s not found",@@ -305,12 +309,22 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml, * wasn't returned, but just to be safe, and since the check * is fast...let's be sure */- if (NULL == remote_pml) {+ if (NULL == val) { opal_output_verbose( 10, ompi_pml_base_framework.framework_output, "check:select: got a NULL pml from process %s", OMPI_NAME_PRINT(&proc_name)); return OMPI_ERR_UNREACH; }+ if (PMIX_STRING != val->type) {+ opal_output_verbose( 10, ompi_pml_base_framework.framework_output,+ "check:select: got a non-string (%s) pml from process %s",+ PMIx_Data_type_string(val->type), OMPI_NAME_PRINT(&proc_name));+ PMIX_VALUE_RELEASE(val);+ return OMPI_ERR_UNREACH;+ }+ remote_pml = val->data.string;+ val->data.string = NULL; // protect data+ PMIX_VALUE_RELEASE(val); opal_output_verbose( 10, ompi_pml_base_framework.framework_output, "check:select: checking my pml %s against process %s"@@ -318,18 +332,24 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml, remote_pml); /* if that module doesn't match my own, return an error */- if ((size != strlen(my_pml) + 1) ||- (0 != strcmp(my_pml, remote_pml))) {+ if (0 != strcmp(my_pml, remote_pml)) { char *errhost = NULL;- OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_HOSTNAME, &proc_name,- &(errhost), PMIX_STRING);+ PMIX_INFO_LOAD(&optional, PMIX_OPTIONAL, NULL, PMIX_BOOL);+ ret = PMIx_Get(&proc, PMIX_HOSTNAME, &optional, 1, &val);+ if (PMIX_SUCCESS == ret && NULL != val && PMIX_STRING == val->type) {+ errhost = val->data.string;+ val->data.string = NULL; // protect data+ PMIX_VALUE_RELEASE(val);+ } opal_output(0, "%s selected pml %s, but peer %s on %s selected pml %s", OMPI_NAME_PRINT(&ompi_proc_local()->super.proc_name), my_pml, OMPI_NAME_PRINT(&proc_name), (NULL == errhost) ? "unknown" : errhost, remote_pml); free(remote_pml);- free(errhost);+ if (NULL != errhost) {+ free(errhost);+ } /* cleanup before returning */ return OMPI_ERR_UNREACH; } Note that you'll still need to put --mca pml_base_verbose 10 on your cmd line to see the error report, and this won't fix anything - just provide a little more debug info. — Reply to this email directly, view it on GitHub <#12475 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BH5S4SLSGK5EEBP3LNYWIRTY6EOUJAVCNFSM6AAAAABGN2B6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGYZTOOBSGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

rhc54 · 2024-04-19T14:51:56Z

Thanks! Really appreciate the help in debugging this problem.

bhendersonPlano · 2024-04-19T15:30:23Z

no problem - thanks for the suggestions and patch. I've gotten through the first two tests on my list and some good news.

Backing down to pmix 4.2.9 and rebuilding slurm 23.02.7 with it worked. (Note that I've not rebuilt openmpi against the older pmix yet so it still has 5.0.2)

My original setup (full details at top of thread) also worked by just setting PMIX_MCA_gds=hash, so that is interesting as well.

I'll reach out to the PMIx folks for some next steps since things seem to be pointing that direction.

I can still try out the patch if you like - just let me know.

rhc54 · 2024-04-19T15:59:33Z

Hmmm...sounds like we do have a problem in the shared memory component in PMIx. Rats - it's the hardest to chase down, and hardest problems to reproduce (I still cannot reproduce this problem locally). Trying out the patch would be appreciated - it would be good to know what data type the returned value was. Would tell us if the stored object was bad, or we are just missing the string.

FWIW: with Slurm using PMIx 4.2.9 and OMPI on 5.0.2, the shmem component will be disabled as the two versions shared memory implementations are incompatible. So it effectively forces the "gds=hash" setting.

Thanks again!

bhendersonPlano · 2024-04-19T16:59:23Z

looks like your suspicion was correct on receiving a non-string object:

[cn04] check:select: got a non-string (PMIX_BYTE_OBJECT) pml from process [[35399,0],0]

Full file here - bad_run_output_patched.txt

let me know if there are any other runs you want/need for debugging purposes.

rhc54 · 2024-04-19T17:17:49Z

If you wouldn't mind, perhaps you could apply this patch to the PMIx v5.0.2 code:

diff --git a/src/mca/gds/shmem2/gds_shmem2.c b/src/mca/gds/shmem2/gds_shmem2.c
index 6f7f2ea3..11b48f39 100644
--- a/src/mca/gds/shmem2/gds_shmem2.c
+++ b/src/mca/gds/shmem2/gds_shmem2.c
@@ -1860,6 +1860,8 @@ client_update_global_keyindex_if_necessary(
     pmix_keyindex_t tmpindex;
     bool found;

+    return PMIX_SUCCESS;
+
     // Do we need to update?
     pmix_gds_shmem2_job_t *job;
     rc = pmix_gds_shmem2_get_job_tracker(nspace_name, true, &job);

It would be interesting to see if that is the problem. If so, it means that we are losing track of the key-to-index translation, and so we are returning the wrong value when asked for the PML key. The patch should be okay as long as both Slurm and OMPI are using the same PMIx code.

bhendersonPlano · 2024-04-19T20:31:00Z

I applied the patch and rebuilt PMIx. Next, I rebuilt slurm and openmpi pointing to that new build. Good news is that everything worked. I kind of wish I had wrapped that 'just return success' with an env variable so I could turn it on and off easily. If I get some time later today maybe I'll do that, but for now - your patch above to the PMIx 5.0.2 tree worked great!

rhc54 · 2024-04-19T20:41:56Z

Ah, excellent - many thanks! I think I know what is going on, but can take that up over on the PMIx ticket. Thanks again for all the help.

bhendersonPlano mentioned this issue Apr 19, 2024

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml openpmix/openpmix#3339

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

bhendersonPlano commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bosilca commented Apr 18, 2024

rhc54 commented Apr 18, 2024 •

edited

bosilca commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024 via email

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024 via email

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

Comments

bhendersonPlano commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bosilca commented Apr 18, 2024

rhc54 commented Apr 18, 2024 • edited

bosilca commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024

rhc54 commented Apr 18, 2024

bhendersonPlano commented Apr 18, 2024 via email

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024 via email

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

bhendersonPlano commented Apr 19, 2024

rhc54 commented Apr 19, 2024

rhc54 commented Apr 18, 2024 •

edited