New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job launch fails with srun: [cn05] [[31923,0],1] selected pml ob1, but peer [[31923,0],0] on cn04 selected pml #12475
Comments
Looks like everything ran okay, but then OMPI tries to re-establish connections during finalize - which fails because the messaging system already shutdown. |
Both processes initialize the BTLs, so it seems obvious they both selected the OB1 PML. But somehow that information is now properly exchaged between them. More precisely, it sounds as if two peers were in different modex collecting modes, one in direct and one in full, so they take different path in Let's make sure they both provide the same PML name. Put a breakpoint or a printf in |
@bosilca If you look carefully at his output, you can see that we successfully completed |
@rhc54 that is not correct, you seems to be confusing the "correct" and the "bad" output. If you take a look at the complete output posted you will notice the lack of application output, aka. the hello message is nowhere to be found, so we are not yet out of the |
I see - very confusing the way that was presented. I stand corrected. |
Sorry for being confusing - will try a different approach next time. Looks like it is using the same pml on both processes.
|
FWIW: the code that exchanges the PML name, when combined with the Given that |
Yep, we have 'MpiDefault=pmix' in the slurm.conf file so I didn't need it
on the srun command line. Good suggestion about the slurm 23.11.x build.
I may drop back a major revision and see if things 'just work' before
opening an issue with the schedmd folks. Thanks for the help.
…On Thu, Apr 18, 2024 at 4:57 PM Ralph Castain ***@***.***> wrote:
FWIW: the code that exchanges the PML name, when combined with the
OPAL_MODEX_foomacros to send/recv the data, probably isn't written as it
should be. It treats strings as byte objects, which means you lose all the
type checking PMIx provides. It would be better if one actually "put" a
string, and then received a string, to ensure that things are properly
matched and the strings terminated.
Given that mpirun is working correctly and you are getting identical
results for both OMPI v4.1 and v5.0, I'm guessing it is something in the
handoff with Slurm that is causing the problem. I note that you didn't
specify -mpi=pmix on your srun cmd line - but I'm guessing you must have
that locked in the Slurm config file? We know there are some issues with
Slurm 23.11 (though the ones we are currently working thru aren't related
to this), so it could be another problem in there as well. Have you filed
an issue with them?
—
Reply to this email directly, view it on GitHub
<#12475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BH5S4SJKSK7OAY245BHH3S3Y6A6TDAVCNFSM6AAAAABGN2B6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRVGM4TENBVGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Please let us know if things work correctly with a different Slurm version. You might also try adding If you feel like it, you could also apply the following patch to your OMPI code - it will check that the PML string we get back from the exchange is in fact a string (and hasn't been corrupted). Since the error message is printing out an empty string, it could well be that the returned value is something non-string (and we mistakenly just assume it is a string). diff --git a/ompi/mca/pml/base/pml_base_select.c b/ompi/mca/pml/base/pml_base_select.c
index e66ec02b5e..8536dfaf7f 100644
--- a/ompi/mca/pml/base/pml_base_select.c
+++ b/ompi/mca/pml/base/pml_base_select.c
@@ -17,6 +17,7 @@
* Copyright (c) 2020-2022 Amazon.com, Inc. or its affiliates. All Rights
* Copyright (c) 2018-2020 Triad National Security, LLC. All rights
* reserved.
+ * Copyright (c) 2024 Nanook Consulting All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -239,17 +240,6 @@ int mca_pml_base_select(bool enable_progress_threads,
return ret;
}
-/* need a "commonly" named PML structure so everything ends up in the
- same modex field */
-static mca_base_component_t pml_base_component = {
- OMPI_MCA_BASE_VERSION_2_1_0("pml", 2, 0, 0),
- .mca_component_name = "base",
- .mca_component_major_version = 2,
- .mca_component_minor_version = 0,
- .mca_component_release_version = 0,
-};
-
-
/*
* If direct modex, then publish PML for all procs. If full modex then
* publish PML for rank 0 only. This information is used during add_procs
@@ -268,11 +258,19 @@ static mca_base_component_t pml_base_component = {
int
mca_pml_base_pml_selected(const char *name)
{
- int rc = 0;
+ int rc = OMPI_SUCCESS;
+ pmix_value_t val;
+ /* if we were told not to perform the check, then no point
+ * in unnecessarily adding to the modex payload
+ */
+ if (!ompi_pml_base_check_pml) {
+ return OMPI_SUCCESS;
+ }
if (!opal_pmix_collect_all_data || 0 == OMPI_PROC_MY_NAME->vpid) {
- OPAL_MODEX_SEND(rc, PMIX_GLOBAL, &pml_base_component, name,
- strlen(name) + 1);
+ PMIX_VALUE_LOAD(&val, name, PMIX_STRING);
+ rc = PMIx_Put(PMIX_GLOBAL, "OMPI_PML_BASE", &val);
+ PMIX_VALUE_DESTRUCT(&val);
}
return rc;
}
@@ -281,9 +279,11 @@ static int
mca_pml_base_pml_check_selected_impl(const char *my_pml,
opal_process_name_t proc_name)
{
- size_t size;
int ret = 0;
char *remote_pml;
+ pmix_proc_t proc;
+ pmix_value_t *val = NULL;
+ pmix_info_t optional;
/* if we are proc_name=OMPI_PROC_MY_NAME, then we can also assume success */
if (0 == opal_compare_proc(ompi_proc_local()->super.proc_name, proc_name)) {
@@ -291,9 +291,13 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
"check:select: PML check not necessary on self");
return OMPI_SUCCESS;
}
- OPAL_MODEX_RECV_STRING(ret,
- mca_base_component_to_string(&pml_base_component),
- &proc_name, (void**) &remote_pml, &size);
+
+ /* Note that we cannot tell PMIx to just check our local modex info
+ * because we might have requested a direct modex
+ */
+ OPAL_PMIX_CONVERT_NAME(&proc, &proc_name);
+ ret = PMIx_Get(&proc, "OMPI_PML_BASE", NULL, 0, &val);
+
if (PMIX_ERR_NOT_FOUND == ret) {
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: PML modex for process %s not found",
@@ -305,12 +309,22 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
* wasn't returned, but just to be safe, and since the check
* is fast...let's be sure
*/
- if (NULL == remote_pml) {
+ if (NULL == val) {
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: got a NULL pml from process %s",
OMPI_NAME_PRINT(&proc_name));
return OMPI_ERR_UNREACH;
}
+ if (PMIX_STRING != val->type) {
+ opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
+ "check:select: got a non-string (%s) pml from process %s",
+ PMIx_Data_type_string(val->type), OMPI_NAME_PRINT(&proc_name));
+ PMIX_VALUE_RELEASE(val);
+ return OMPI_ERR_UNREACH;
+ }
+ remote_pml = val->data.string;
+ val->data.string = NULL; // protect data
+ PMIX_VALUE_RELEASE(val);
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: checking my pml %s against process %s"
@@ -318,18 +332,24 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
remote_pml);
/* if that module doesn't match my own, return an error */
- if ((size != strlen(my_pml) + 1) ||
- (0 != strcmp(my_pml, remote_pml))) {
+ if (0 != strcmp(my_pml, remote_pml)) {
char *errhost = NULL;
- OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_HOSTNAME, &proc_name,
- &(errhost), PMIX_STRING);
+ PMIX_INFO_LOAD(&optional, PMIX_OPTIONAL, NULL, PMIX_BOOL);
+ ret = PMIx_Get(&proc, PMIX_HOSTNAME, &optional, 1, &val);
+ if (PMIX_SUCCESS == ret && NULL != val && PMIX_STRING == val->type) {
+ errhost = val->data.string;
+ val->data.string = NULL; // protect data
+ PMIX_VALUE_RELEASE(val);
+ }
opal_output(0, "%s selected pml %s, but peer %s on %s selected pml %s",
OMPI_NAME_PRINT(&ompi_proc_local()->super.proc_name),
my_pml, OMPI_NAME_PRINT(&proc_name),
(NULL == errhost) ? "unknown" : errhost,
remote_pml);
free(remote_pml);
- free(errhost);
+ if (NULL != errhost) {
+ free(errhost);
+ }
/* cleanup before returning */
return OMPI_ERR_UNREACH;
} Note that you'll still need to put |
Thanks Ralph - I backed down to slurm 23.02.7 (with the same hwloc and
pmix) and saw the same issue.
My next experiment is to back down PMIx from 5.0.2 to 4.2.9. Next on the
list is the PMIX_MCA_gds=hash setting and then patch above with the
pml_base_verbose.
I'll keep you posted on what I find out
…On Fri, Apr 19, 2024 at 8:53 AM Ralph Castain ***@***.***> wrote:
Please let us know if things work correctly with a different Slurm
version. You might also try adding PMIX_MCA_gds=hash to your environment
to see if that makes a difference - just forces use of a different
key-value store mechanism in case there is some corruption in the default
one.
If you feel like it, you could also apply the following patch to your OMPI
code - it will check that the PML string we get back from the exchange is
in fact a string (and hasn't been corrupted). Since the error message is
printing out an empty string, it could well be that the returned value is
something non-string (and we mistakenly just assume it is a string).
diff --git a/ompi/mca/pml/base/pml_base_select.c b/ompi/mca/pml/base/pml_base_select.c
index e66ec02b5e..8536dfaf7f 100644--- a/ompi/mca/pml/base/pml_base_select.c+++ b/ompi/mca/pml/base/pml_base_select.c@@ -17,6 +17,7 @@
* Copyright (c) 2020-2022 Amazon.com, Inc. or its affiliates. All Rights
* Copyright (c) 2018-2020 Triad National Security, LLC. All rights
* reserved.+ * Copyright (c) 2024 Nanook Consulting All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow@@ -239,17 +240,6 @@ int mca_pml_base_select(bool enable_progress_threads,
return ret;
}
-/* need a "commonly" named PML structure so everything ends up in the- same modex field */-static mca_base_component_t pml_base_component = {- OMPI_MCA_BASE_VERSION_2_1_0("pml", 2, 0, 0),- .mca_component_name = "base",- .mca_component_major_version = 2,- .mca_component_minor_version = 0,- .mca_component_release_version = 0,-};--
/*
* If direct modex, then publish PML for all procs. If full modex then
* publish PML for rank 0 only. This information is used during add_procs@@ -268,11 +258,19 @@ static mca_base_component_t pml_base_component = {
int
mca_pml_base_pml_selected(const char *name)
{- int rc = 0;+ int rc = OMPI_SUCCESS;+ pmix_value_t val;
+ /* if we were told not to perform the check, then no point+ * in unnecessarily adding to the modex payload+ */+ if (!ompi_pml_base_check_pml) {+ return OMPI_SUCCESS;+ }
if (!opal_pmix_collect_all_data || 0 == OMPI_PROC_MY_NAME->vpid) {- OPAL_MODEX_SEND(rc, PMIX_GLOBAL, &pml_base_component, name,- strlen(name) + 1);+ PMIX_VALUE_LOAD(&val, name, PMIX_STRING);+ rc = PMIx_Put(PMIX_GLOBAL, "OMPI_PML_BASE", &val);+ PMIX_VALUE_DESTRUCT(&val);
}
return rc;
}@@ -281,9 +279,11 @@ static int
mca_pml_base_pml_check_selected_impl(const char *my_pml,
opal_process_name_t proc_name)
{- size_t size;
int ret = 0;
char *remote_pml;+ pmix_proc_t proc;+ pmix_value_t *val = NULL;+ pmix_info_t optional;
/* if we are proc_name=OMPI_PROC_MY_NAME, then we can also assume success */
if (0 == opal_compare_proc(ompi_proc_local()->super.proc_name, proc_name)) {@@ -291,9 +291,13 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
"check:select: PML check not necessary on self");
return OMPI_SUCCESS;
}- OPAL_MODEX_RECV_STRING(ret,- mca_base_component_to_string(&pml_base_component),- &proc_name, (void**) &remote_pml, &size);++ /* Note that we cannot tell PMIx to just check our local modex info+ * because we might have requested a direct modex+ */+ OPAL_PMIX_CONVERT_NAME(&proc, &proc_name);+ ret = PMIx_Get(&proc, "OMPI_PML_BASE", NULL, 0, &val);+
if (PMIX_ERR_NOT_FOUND == ret) {
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: PML modex for process %s not found",@@ -305,12 +309,22 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
* wasn't returned, but just to be safe, and since the check
* is fast...let's be sure
*/- if (NULL == remote_pml) {+ if (NULL == val) {
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: got a NULL pml from process %s",
OMPI_NAME_PRINT(&proc_name));
return OMPI_ERR_UNREACH;
}+ if (PMIX_STRING != val->type) {+ opal_output_verbose( 10, ompi_pml_base_framework.framework_output,+ "check:select: got a non-string (%s) pml from process %s",+ PMIx_Data_type_string(val->type), OMPI_NAME_PRINT(&proc_name));+ PMIX_VALUE_RELEASE(val);+ return OMPI_ERR_UNREACH;+ }+ remote_pml = val->data.string;+ val->data.string = NULL; // protect data+ PMIX_VALUE_RELEASE(val);
opal_output_verbose( 10, ompi_pml_base_framework.framework_output,
"check:select: checking my pml %s against process %s"@@ -318,18 +332,24 @@ mca_pml_base_pml_check_selected_impl(const char *my_pml,
remote_pml);
/* if that module doesn't match my own, return an error */- if ((size != strlen(my_pml) + 1) ||- (0 != strcmp(my_pml, remote_pml))) {+ if (0 != strcmp(my_pml, remote_pml)) {
char *errhost = NULL;- OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_HOSTNAME, &proc_name,- &(errhost), PMIX_STRING);+ PMIX_INFO_LOAD(&optional, PMIX_OPTIONAL, NULL, PMIX_BOOL);+ ret = PMIx_Get(&proc, PMIX_HOSTNAME, &optional, 1, &val);+ if (PMIX_SUCCESS == ret && NULL != val && PMIX_STRING == val->type) {+ errhost = val->data.string;+ val->data.string = NULL; // protect data+ PMIX_VALUE_RELEASE(val);+ }
opal_output(0, "%s selected pml %s, but peer %s on %s selected pml %s",
OMPI_NAME_PRINT(&ompi_proc_local()->super.proc_name),
my_pml, OMPI_NAME_PRINT(&proc_name),
(NULL == errhost) ? "unknown" : errhost,
remote_pml);
free(remote_pml);- free(errhost);+ if (NULL != errhost) {+ free(errhost);+ }
/* cleanup before returning */
return OMPI_ERR_UNREACH;
}
Note that you'll still need to put --mca pml_base_verbose 10 on your cmd
line to see the error report, and this won't fix anything - just provide a
little more debug info.
—
Reply to this email directly, view it on GitHub
<#12475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BH5S4SLSGK5EEBP3LNYWIRTY6EOUJAVCNFSM6AAAAABGN2B6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGYZTOOBSGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks! Really appreciate the help in debugging this problem. |
no problem - thanks for the suggestions and patch. I've gotten through the first two tests on my list and some good news. Backing down to pmix 4.2.9 and rebuilding slurm 23.02.7 with it worked. (Note that I've not rebuilt openmpi against the older pmix yet so it still has 5.0.2) My original setup (full details at top of thread) also worked by just setting PMIX_MCA_gds=hash, so that is interesting as well. I'll reach out to the PMIx folks for some next steps since things seem to be pointing that direction. I can still try out the patch if you like - just let me know. |
Hmmm...sounds like we do have a problem in the shared memory component in PMIx. Rats - it's the hardest to chase down, and hardest problems to reproduce (I still cannot reproduce this problem locally). Trying out the patch would be appreciated - it would be good to know what data type the returned value was. Would tell us if the stored object was bad, or we are just missing the string. FWIW: with Slurm using PMIx 4.2.9 and OMPI on 5.0.2, the shmem component will be disabled as the two versions shared memory implementations are incompatible. So it effectively forces the "gds=hash" setting. Thanks again! |
looks like your suspicion was correct on receiving a non-string object:
Full file here - bad_run_output_patched.txt let me know if there are any other runs you want/need for debugging purposes. |
If you wouldn't mind, perhaps you could apply this patch to the PMIx v5.0.2 code: diff --git a/src/mca/gds/shmem2/gds_shmem2.c b/src/mca/gds/shmem2/gds_shmem2.c
index 6f7f2ea3..11b48f39 100644
--- a/src/mca/gds/shmem2/gds_shmem2.c
+++ b/src/mca/gds/shmem2/gds_shmem2.c
@@ -1860,6 +1860,8 @@ client_update_global_keyindex_if_necessary(
pmix_keyindex_t tmpindex;
bool found;
+ return PMIX_SUCCESS;
+
// Do we need to update?
pmix_gds_shmem2_job_t *job;
rc = pmix_gds_shmem2_get_job_tracker(nspace_name, true, &job); It would be interesting to see if that is the problem. If so, it means that we are losing track of the key-to-index translation, and so we are returning the wrong value when asked for the PML key. The patch should be okay as long as both Slurm and OMPI are using the same PMIx code. |
I applied the patch and rebuilt PMIx. Next, I rebuilt slurm and openmpi pointing to that new build. Good news is that everything worked. I kind of wish I had wrapped that 'just return success' with an env variable so I could turn it on and off easily. If I get some time later today maybe I'll do that, but for now - your patch above to the PMIx 5.0.2 tree worked great! |
Ah, excellent - many thanks! I think I know what is going on, but can take that up over on the PMIx ticket. Thanks again for all the help. |
I'm having an issue with a two node job launch when using srun and looking for some help. A normal mpirun invoking mpirun from within an salloc works fine.
The system is a small number of dual socket nodes running RHEL 9.3 each with an Intel E810-C card (only port zero has a connection) and there are no other network cards in the system. The network connection is configured in an active-backup bond which I know is odd, but how our imaging tool likes things. There is a single 100G switch and only one subnet. /home is shared across the nodes via NFS.
Software is OpenMPI (5.0.3) with user built hwloc (2.10.0), pmix (5.0.2) and slurm (23.11.6).
Build options are:
Slurm was also built with the same versions of hwloc and pmix.
A simple test works fine with mpirun directly:
An mpirun from within an salloc works fine:
However a srun does not work:
Note that on the error line, the second node is not showing what it did select - which seems odd. It does not matter which two nodes I select, the first node (cn04 above) complains about the second node (cn03 above) not matching in the pml selection.
I added OMPI_MCA_btl_base_verbose=100 OMPI_MCA_pml_ob1_verbose=100 to my launch, cleaned up the output a little and then diffed the two. No differences were shown up to the point where the error message comes out.
I tried with OpenMPI 4.1.6 as well but ran into similar issues. Figured it would be easier to debug with 5.x and it is where I would like to end up as well. Any guidance on next steps would be appreciated.
The text was updated successfully, but these errors were encountered: