Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475

Open
bhendersonPlano opened this issue Apr 18, 2024 · 18 comments

Comments

@bhendersonPlano
Copy link

I'm having an issue with a two node job launch when using srun and looking for some help. A normal mpirun invoking mpirun from within an salloc works fine.

The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.

Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).

Build options are:

./configure \
     --prefix=/home/brent/sys/openmpi/5.0.3 \
     --with-hwloc=/home/brent/sys/hwloc/2.10.0 \
     --with-slurm=/home/brent/sys/slurm/23.11.5 \
     --with-pmix=/home/brent/sys/pmix/5.0.2 \
     --disable-ipv6 \
     --enable-orterun-prefix-by-default

Slurm was also built with the same versions of hwloc and pmix.


A simple test works fine with mpirun directly:

$ mpirun -n 2 -H cn03,cn04 ./hello_mpi.503
Hello from rank 0 on cn03
Hello from rank 1 on cn04

An mpirun from within an salloc works fine:

$ salloc -w cn03,cn04 --ntasks-per-node=1 mpirun ./hello_mpi.503 
salloc: Granted job allocation 136
salloc: Nodes cn[03-04] are ready for job
Hello from rank 0 on cn03
Hello from rank 1 on cn04
salloc: Relinquishing job allocation 136

However a srun does not work:

[cn04:545610] [[55359,0],1] selected pml ob1, but peer [[55359,0],0] on cn03 selected pml 

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 138.0 ON cn03 CANCELLED AT 2024-04-18T13:05:02 ***
srun: error: cn04: task 1: Exited with exit code 14
srun: Terminating StepId=138.0
srun: error: cn03: task 0: Killed

Note that on the error line, the second node is not showing what it did select - which seems odd. It does not matter which two nodes I select, the first node (cn04 above) complains about the second node (cn03 above) not matching in the pml selection.

I added OMPI_MCA_btl_base_verbose=100 OMPI_MCA_pml_ob1_verbose=100 to my launch, cleaned up the output a little and then diffed the two. No differences were shown up to the point where the error message comes out.

$ diff good bad
26c26
< [cn04] btl:tcp: 0x904660: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
---
> [cn04] btl:tcp: 0x24e5830: if bond0 kidx 6 cnt 0 addr 10.23.0.14 IPv4 bw 100000 lt 100
57c57
< [cn03] btl:tcp: 0xc33660: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
---
> [cn03] btl:tcp: 0x1a69830: if bond0 kidx 7 cnt 0 addr 10.23.0.13 IPv4 bw 100000 lt 100
63,84c63,70
< [cn03] mca: bml: Using self btl for send to [[56142,1],0] on node cn03
< [cn04] mca: bml: Using self btl for send to [[56142,1],1] on node cn04
< Hello from rank 0 on cn03
< [cn04] mca: bml: Using tcp btl for send to [[56142,1],0] on node cn03
< [cn04] btl: tcp: attempting to connect() to [[56142,1],0] address 10.23.0.13 on port 1024
< [cn04] btl:tcp: would block, so allowing background progress
< [cn03] mca: bml: Using tcp btl for send to [[56142,1],1] on node cn04
< [cn03] btl: tcp: attempting to connect() to [[56142,1],1] address 10.23.0.14 on port 1024
< [cn03] btl:tcp: would block, so allowing background progress
< [cn03] btl:tcp: connect() to 10.23.0.14:1024 completed (complete_connect), sending connect ACK
< [cn03] btl:tcp: now connected to 10.23.0.14, process [[56142,1],1]
< [cn04] btl:tcp: connect() to 10.23.0.13:1024 completed (complete_connect), sending connect ACK
< [cn04] btl:tcp: now connected to 10.23.0.13, process [[56142,1],0]
< Hello from rank 1 on cn04
< [cn04] mca: base: close: component self closed
< [cn04] mca: base: close: unloading component self
< [cn04] mca: base: close: component tcp closed
< [cn04] mca: base: close: unloading component tcp
< [cn03] mca: base: close: component self closed
< [cn03] mca: base: close: unloading component self
< [cn03] mca: base: close: component tcp closed
< [cn03] mca: base: close: unloading component tcp
---
> [cn03] mca: bml: Using self btl for send to [[43367,0],0] on node cn03
> [cn04] [[43367,0],1] selected pml ob1, but peer [[43367,0],0] on cn03 selected pml 
> 
> slurmstepd: error: *** STEP 131.0 ON cn03 CANCELLED AT 2024-04-18T10:42:56 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: cn04: task 1: Exited with exit code 14
> srun: Terminating StepId=131.0
> srun: error: cn03: task 0: Killed

I tried with OpenMPI 4.1.6 as well but ran into similar issues. Figured it would be easier to debug with 5.x and it is where I would like to end up as well. Any guidance on next steps would be appreciated.

@bhendersonPlano
Copy link
Author

bad_run_output.txt

@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2024

Looks like everything ran okay, but then OMPI tries to re-establish connections during finalize - which fails because the messaging system already shutdown.

@bosilca
Copy link
Member

bosilca commented Apr 18, 2024

Both processes initialize the BTLs, so it seems obvious they both selected the OB1 PML. But somehow that information is now properly exchaged between them. More precisely, it sounds as if two peers were in different modex collecting modes, one in direct and one in full, so they take different path in mca_pml_base_pml_check_selected. Or, and this would be the worst case scenario, the modex is somehow corrupted.

Let's make sure they both provide the same PML name. Put a breakpoint or a printf in mca_pml_base_pml_selected and dump the process name and the PML selected (or not).

@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2024

@bosilca If you look carefully at his output, you can see that we successfully completed MPI_Init and outputted his "hello" message. We then enter MPI_Finalize, the various btl components close - and then we re-enter a connection attempt, which fails. So it isn't the modex or anything in the initial channel that is failing - the question is: why are we attempting to reconnect after we finalize the BTLs?

@bosilca
Copy link
Member

bosilca commented Apr 18, 2024

@rhc54 that is not correct, you seems to be confusing the "correct" and the "bad" output. If you take a look at the complete output posted you will notice the lack of application output, aka. the hello message is nowhere to be found, so we are not yet out of the MPI_Init.

@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2024

I see - very confusing the way that was presented. I stand corrected.

@bhendersonPlano
Copy link
Author

Sorry for being confusing - will try a different approach next time.

Looks like it is using the same pml on both processes.

$ srun --reservation=brent -N 2 -n 2 /home/brent/src/mpi/hello_mpi/hello_mpi.503
DEBUG: process 90471 is using pml ob1
DEBUG: process 89501 is using pml ob1
[cn10:89501] [[4419,0],1] selected pml ob1, but peer [[4419,0],0] on cn09 selected pml 

slurmstepd: error: *** STEP 166.0 ON cn09 CANCELLED AT 2024-04-18T16:34:52 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: cn09: task 0: Killed
srun: Terminating StepId=166.0
srun: error: cn10: task 1: Exited with exit code 14

@rhc54
Copy link
Contributor

rhc54 commented Apr 18, 2024

FWIW: the code that exchanges the PML name, when combined with the OPAL_MODEX_foomacros to send/recv the data, probably isn't written as it should be. It treats strings as byte objects, which means you lose all the type checking PMIx provides. It would be better if one actually "put" a string, and then received a string, to ensure that things are properly matched and the strings terminated.

Given that mpirun is working correctly and you are getting identical results for both OMPI v4.1 and v5.0, I'm guessing it is something in the handoff with Slurm that is causing the problem. I note that you didn't specify -mpi=pmix on your srun cmd line - but I'm guessing you must have that locked in the Slurm config file? We know there are some issues with Slurm 23.11 (though the ones we are currently working thru aren't related to this), so it could be another problem in there as well. Have you filed an issue with them?

@bhendersonPlano
Copy link
Author

bhendersonPlano commented Apr 18, 2024 via email

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2024

Please let us know if things work correctly with a different Slurm version. You might also try adding PMIX_MCA_gds=hash to your environment to see if that makes a difference - just forces use of a different key-value store mechanism in case there is some corruption in the default one.

If you feel like it, you could also apply the following patch to your OMPI code - it will check that the PML string we get back from the exchange is in fact a string (and hasn't been corrupted). Since the error message is printing out an empty string, it could well be that the returned value is something non-string (and we mistakenly just assume it is a string).

diff --git a/ompi/mca/pml/base/pml_base_select.c b/ompi/mca/pml/base/pml_base_select.c
index e66ec02b5e..8536dfaf7f 100644
--- a/ompi/mca/pml/base/pml_base_select.c
+++ b/ompi/mca/pml/base/pml_base_select.c
@@ -17,6 +17,7 @@
  * Copyright (c) 2020-2022 Amazon.com, Inc. or its affiliates.  All Rights
  * Copyright (c) 2018-2020 Triad National Security, LLC. All rights
  *                         reserved.
+ * Copyright (c) 2024      Nanook Consulting  All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -239,17 +240,6 @@ int mca_pml_base_select(bool enable_progress_threads,
     return ret;
 }

-/* need a "commonly" named PML structure so everything ends up in the
-   same modex field */
-static mca_base_component_t pml_base_component = {
-    OMPI_MCA_BASE_VERSION_2_1_0("pml", 2, 0, 0),
-    .mca_component_name = "base",
-    .mca_component_major_version = 2,
-    .mca_component_minor_version = 0,
-    .mca_component_release_version = 0,
-};
-
-
 /*
  * If direct modex, then publish PML for all procs. If full modex then
  * publish PML for rank 0 only. This information is used during add_procs
@@ -268,11 +258,19 @@ static mca_base_component_t pml_base_component = {
 int
 mca_pml_base_pml_selected(const char *name)
 {
-    int rc = 0;
+    int rc = OMPI_SUCCESS;
+    pmix_value_t val;

+    /* if we were told not to perform the check, then no point
+     * in unnecessarily adding to the modex payload
+     */
+    if (!ompi_pml_base_check_pml) {
+        return OMPI_SUCCESS;
+    }
     if (!opal_pmix_collect_all_data || 0 == OMPI_PROC_MY_NAME->vpid) {
-        OPAL_MODEX_SEND(rc, PMIX_GLOBAL, &pml_base_component, name,
-                        strlen(name) + 1);
+        PMIX_VALUE_LOAD(&val, name, PMIX_STRING);
+        rc = PMIx_Put(PMIX_GLOBAL, "OMPI_PML_BASE", &val);
+        PMIX_VALUE_DESTRUCT(&val);
     }
     return rc;
 }
@@ -281,9 +279,11 @@ static int
 mca_pml_base_pml_check_selected_impl(const char *my_pml,
                                      opal_process_name_t proc_name)
 {
-    size_t size;
     int ret = 0;
     char *remote_pml;
+    pmix_proc_t proc;
+    pmix_value_t *val = NULL;
+    pmix_info_t optional;

     /* if we are proc_name=OMPI_PROC_MY_NAME, then we can also assume success */
     if (0 == opal_compare_proc(ompi_proc_local()->super.proc_name, proc_name)) {
@@ -291,9 +291,13 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
                             "check:select: PML check not necessary on self");
         return OMPI_SUCCESS;
     }
-    OPAL_MODEX_RECV_STRING(ret,
-                           mca_base_component_to_string(&pml_base_component),
-                           &proc_name, (void**) &remote_pml, &size);
+
+    /* Note that we cannot tell PMIx to just check our local modex info
+     * because we might have requested a direct modex
+     */
+    OPAL_PMIX_CONVERT_NAME(&proc, &proc_name);
+    ret = PMIx_Get(&proc, "OMPI_PML_BASE", NULL, 0, &val);
+
     if (PMIX_ERR_NOT_FOUND == ret) {
         opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                             "check:select: PML modex for process %s not found",
@@ -305,12 +309,22 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
      * wasn't returned, but just to be safe, and since the check
      * is fast...let's be sure
      */
-    if (NULL == remote_pml) {
+    if (NULL == val) {
         opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                             "check:select: got a NULL pml from process %s",
                             OMPI_NAME_PRINT(&proc_name));
         return OMPI_ERR_UNREACH;
     }
+    if (PMIX_STRING != val->type) {
+        opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
+                            "check:select: got a non-string (%s) pml from process %s",
+                            PMIx_Data_type_string(val->type), OMPI_NAME_PRINT(&proc_name));
+        PMIX_VALUE_RELEASE(val);
+        return OMPI_ERR_UNREACH;
+    }
+    remote_pml = val->data.string;
+    val->data.string = NULL;  // protect data
+    PMIX_VALUE_RELEASE(val);

     opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
                         "check:select: checking my pml %s against process %s"
@@ -318,18 +332,24 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
                         remote_pml);

     /* if that module doesn't match my own, return an error */
-    if ((size != strlen(my_pml) + 1) ||
-        (0 != strcmp(my_pml, remote_pml))) {
+    if (0 != strcmp(my_pml, remote_pml)) {
         char *errhost = NULL;
-        OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_HOSTNAME, &proc_name,
-                                       &(errhost), PMIX_STRING);
+        PMIX_INFO_LOAD(&optional, PMIX_OPTIONAL, NULL, PMIX_BOOL);
+        ret = PMIx_Get(&proc, PMIX_HOSTNAME, &optional, 1, &val);
+        if (PMIX_SUCCESS == ret && NULL != val && PMIX_STRING == val->type) {
+            errhost = val->data.string;
+            val->data.string = NULL;   // protect data
+            PMIX_VALUE_RELEASE(val);
+        }
         opal_output(0, "%s selected pml %s, but peer %s on %s selected pml %s",
                     OMPI_NAME_PRINT(&ompi_proc_local()->super.proc_name),
                     my_pml, OMPI_NAME_PRINT(&proc_name),
                     (NULL == errhost) ? "unknown" : errhost,
                     remote_pml);
         free(remote_pml);
-        free(errhost);
+        if (NULL != errhost) {
+            free(errhost);
+        }
          /* cleanup before returning */
         return OMPI_ERR_UNREACH;
     }

Note that you'll still need to put --mca pml_base_verbose 10 on your cmd line to see the error report, and this won't fix anything - just provide a little more debug info.

@bhendersonPlano
Copy link
Author

bhendersonPlano commented Apr 19, 2024 via email

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2024

Thanks! Really appreciate the help in debugging this problem.

@bhendersonPlano
Copy link
Author

no problem - thanks for the suggestions and patch. I've gotten through the first two tests on my list and some good news.

Backing down to pmix 4.2.9 and rebuilding slurm 23.02.7 with it worked. (Note that I've not rebuilt openmpi against the older pmix yet so it still has 5.0.2)

My original setup (full details at top of thread) also worked by just setting PMIX_MCA_gds=hash, so that is interesting as well.

I'll reach out to the PMIx folks for some next steps since things seem to be pointing that direction.

I can still try out the patch if you like - just let me know.

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2024

Hmmm...sounds like we do have a problem in the shared memory component in PMIx. Rats - it's the hardest to chase down, and hardest problems to reproduce (I still cannot reproduce this problem locally). Trying out the patch would be appreciated - it would be good to know what data type the returned value was. Would tell us if the stored object was bad, or we are just missing the string.

FWIW: with Slurm using PMIx 4.2.9 and OMPI on 5.0.2, the shmem component will be disabled as the two versions shared memory implementations are incompatible. So it effectively forces the "gds=hash" setting.

Thanks again!

@bhendersonPlano
Copy link
Author

looks like your suspicion was correct on receiving a non-string object:

[cn04] check:select: got a non-string (PMIX_BYTE_OBJECT) pml from process [[35399,0],0]

Full file here - bad_run_output_patched.txt

let me know if there are any other runs you want/need for debugging purposes.

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2024

If you wouldn't mind, perhaps you could apply this patch to the PMIx v5.0.2 code:

diff --git a/src/mca/gds/shmem2/gds_shmem2.c b/src/mca/gds/shmem2/gds_shmem2.c
index 6f7f2ea3..11b48f39 100644
--- a/src/mca/gds/shmem2/gds_shmem2.c
+++ b/src/mca/gds/shmem2/gds_shmem2.c
@@ -1860,6 +1860,8 @@ client_update_global_keyindex_if_necessary(
     pmix_keyindex_t tmpindex;
     bool found;

+    return PMIX_SUCCESS;
+
     // Do we need to update?
     pmix_gds_shmem2_job_t *job;
     rc = pmix_gds_shmem2_get_job_tracker(nspace_name, true, &job);

It would be interesting to see if that is the problem. If so, it means that we are losing track of the key-to-index translation, and so we are returning the wrong value when asked for the PML key. The patch should be okay as long as both Slurm and OMPI are using the same PMIx code.

@bhendersonPlano
Copy link
Author

I applied the patch and rebuilt PMIx. Next, I rebuilt slurm and openmpi pointing to that new build. Good news is that everything worked. I kind of wish I had wrapped that 'just return success' with an env variable so I could turn it on and off easily. If I get some time later today maybe I'll do that, but for now - your patch above to the PMIx 5.0.2 tree worked great!

@rhc54
Copy link
Contributor

rhc54 commented Apr 19, 2024

Ah, excellent - many thanks! I think I know what is going on, but can take that up over on the PMIx ticket. Thanks again for all the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants