Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

frozen ring_c sanity test on ubuntu 22.04 laptop, not cloud VM #12448

Open
richardbeare opened this issue Mar 31, 2024 · 16 comments
Open

frozen ring_c sanity test on ubuntu 22.04 laptop, not cloud VM #12448

richardbeare opened this issue Mar 31, 2024 · 16 comments
Assignees

Comments

@richardbeare
Copy link

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

OpenMPI_4.1.4-GCC-12.2.0
]

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

This installation is part of the foss-2022b easybuild source package

Please describe the system on which you are running

ubuntu 22.04. One cloud VM, one laptop (dell). No networking, purely local tests.

Details of the problem

I am testing easybuild on two ubuntu 22.04 systems - one laptop, one cloud VM. The install and test process works as expected on the cloud VM. On the laptop the ring_c test hangs with 8 processes running at 100% load each.

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1

#8305 points out that modifying as follows fixes the issue:

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose --mca mtl_ofi_provider_include sockets -n 8 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 1 exiting
Process 2 exiting
Process 7 exiting

I haven't been able to figure out what causes the differences in default configuration between the systems.

@bosilca
Copy link
Member

bosilca commented Apr 1, 2024

Do you need to run with the OFI? provider ? The problem with the workaround is that your local communications are now using sockets instead of shared memory, leading to a significant performance penalty.

@richardbeare
Copy link
Author

I assume shared memory communication is the default? How do I figure out why it is working differently on two machines that are supposedly identical OS's?

@bosilca
Copy link
Member

bosilca commented Apr 2, 2024

I don't think the OFI provider can handle two types of network simultaneously. I heard there is an ongoing effort to do so, but I'm not sure if it is part of any release. @hppritcha might have more info.

@hppritcha
Copy link
Member

I think the OFI problem in #8305 is misleading you. Can you post the output of ompi_info from both the laptop and cloud VM install?

@richardbeare
Copy link
Author

I noticed that the ordering of some entries is different, but couldn't see anything beyond that.

Laptop

                 Package: Open MPI richardb@duvel Distribution
                Open MPI: 4.1.4
  Open MPI repo revision: v4.1.4
   Open MPI release date: May 26, 2022
                Open RTE: 4.1.4
  Open RTE repo revision: v4.1.4
   Open RTE release date: May 26, 2022
                    OPAL: 4.1.4
      OPAL repo revision: v4.1.4
       OPAL release date: May 26, 2022
                 MPI API: 3.1.0
            Ident string: 4.1.4
                  Prefix: /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: duvel
           Configured by: richardb
           Configured on: Sat Mar 30 11:19:09 UTC 2024
          Configure host: duvel
  Configure command line: '--prefix=/slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-cuda=internal' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-hwloc=/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0' '--with-libevent=/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0' '--with-ofi=/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0' '--with-pmix=/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0' '--with-ucx=/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0' '--with-ucc=/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0' '--without-verbs' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CC=gcc' 'CFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'LDFLAGS=-L/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib64 -L/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib -L/slowdata/richardb/easybuild/software/GCCcore/12.2.0/lib64 -L/slowdata/richardb/easybuild/software/GCCcore/12.2.0/lib' 'LIBS=-lm -lpthread' 'CPPFLAGS=-I/slowdata/richardb/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/include -I/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/include' 'CXX=g++' 'CXXFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'FC=gfortran' 'FCFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'PKG_CONFIG_PATH=/slowdata/richardb/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libpciaccess/0.17-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/libxml2/2.10.3-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/XZ/5.2.7-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/numactl/2.0.16-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/OpenSSL/1.1/lib/pkgconfig:/slowdata/richardb/easybuild/software/libreadline/8.2-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/expat/2.4.9-GCCcore-12.2.0/lib/pkgconfig:/slowdata/richardb/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib/pkgconfig' '--no-create' '--no-recursion'
                Built by: richardb
                Built on: Sat 30 Mar 2024 11:25:55 UTC
              Built host: duvel
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/gcc
  C compiler family name: GNU
      C compiler version: 12.2.0
            C++ compiler: g++
   C++ compiler absolute: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /slowdata/richardb/easybuild/software/GCCcore/12.2.0/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.4)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.4)

CLoud VM

                 Package: Open MPI ubuntu@easybuild Distribution
                Open MPI: 4.1.4
  Open MPI repo revision: v4.1.4
   Open MPI release date: May 26, 2022
                Open RTE: 4.1.4
  Open RTE repo revision: v4.1.4
   Open RTE release date: May 26, 2022
                    OPAL: 4.1.4
      OPAL repo revision: v4.1.4
       OPAL release date: May 26, 2022
                 MPI API: 3.1.0
            Ident string: 4.1.4
                  Prefix: /home/ubuntu/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0
 Configured architecture: x86_64-pc-linux-gnu
          Configure host: easybuild
           Configured by: ubuntu
           Configured on: Sat Mar 30 15:29:09 UTC 2024
          Configure host: easybuild
  Configure command line: '--prefix=/home/ubuntu/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-cuda=internal' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-hwloc=/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0' '--with-libevent=/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0' '--with-ofi=/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0' '--with-pmix=/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0' '--with-ucx=/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0' '--with-ucc=/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0' '--without-verbs' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu' 'CC=gcc' 'CFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'LDFLAGS=-L/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/Perl/5.36.0-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib -L/home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/lib64 -L/home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/lib' 'LIBS=-lm -lpthread' 'CPPFLAGS=-I/home/ubuntu/.local/easybuild/software/UCC/1.1.0-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/include -I/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/include' 'CXX=g++' 'CXXFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'FC=gfortran' 'FCFLAGS=-O2 -ftree-vectorize -march=native -fno-math-errno' 'PKG_CONFIG_PATH=/home/ubuntu/.local/easybuild/software/PMIx/4.2.2-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/UCX/1.13.1-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libevent/2.1.12-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/hwloc/2.8.0-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libpciaccess/0.17-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libxml2/2.10.3-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/XZ/5.2.7-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/numactl/2.0.16-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/zlib/1.2.12-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/OpenSSL/1.1/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/libreadline/8.2-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/expat/2.4.9-GCCcore-12.2.0/lib/pkgconfig:/home/ubuntu/.local/easybuild/software/pkgconf/1.9.3-GCCcore-12.2.0/lib/pkgconfig' '--no-create' '--no-recursion'
                Built by: ubuntu
                Built on: Sat 30 Mar 2024 15:35:46 UTC
              Built host: easybuild
              C bindings: yes
            C++ bindings: no
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: runpath
              C compiler: gcc
     C compiler absolute: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/gcc
  C compiler family name: GNU
      C compiler version: 12.2.0
            C++ compiler: g++
   C++ compiler absolute: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/g++
           Fort compiler: gfortran
       Fort compiler abs: /home/ubuntu/.local/easybuild/software/GCCcore/12.2.0/bin/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
          Fort PROTECTED: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
           C++ profiling: no
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
 mpirun default --prefix: yes
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
      MPI1 compatibility: no
          MPI extensions: affinity, cuda, pcollreq
   FT Checkpoint support: no (checkpoint thread: no)
   C/R Enabled Debugging: no
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.4)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.4)
            MCA compress: gzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA compress: bzip (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA crs: none (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA event: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA hwloc: external (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v4.1.4)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA mpool: hugepage (MCA v2.1.0, API v3.0.0, Component v4.1.4)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA pmix: flux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA pmix: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA pmix: ext3x (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA pstat: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: gpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
              MCA rcache: rgpusm (MCA v2.1.0, API v3.3.0, Component v4.1.4)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
           MCA reachable: netlink (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v4.1.4)
              MCA errmgr: default_tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_orted (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_app (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA errmgr: default_hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: env (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: hnp (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: singleton (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: pmi (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: slurm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA ess: tool (MCA v2.1.0, API v3.0.0, Component v4.1.4)
               MCA filem: raw (MCA v2.1.0, API v2.0.0, Component v4.1.4)
             MCA grpcomm: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA iof: orted (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: tool (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA iof: hnp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: pspawn (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA odls: default (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA oob: tcp (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: rsh (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA plm: isolated (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: simulator (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA ras: slurm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA regx: fwd (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: reverse (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                MCA regx: naive (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA rmaps: resilient (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: rank_file (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: ppr (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: seq (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: round_robin (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA rmaps: mindist (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rml: oob (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: binomial (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: radix (MCA v2.1.0, API v3.0.0, Component v4.1.4)
              MCA routed: direct (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA rtc: hwloc (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: jsm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: orte (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: slurm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: flux (MCA v2.1.0, API v1.0.0, Component v4.1.4)
              MCA schizo: ompi (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: tool (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: novm (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: orted (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: hnp (MCA v2.1.0, API v1.0.0, Component v4.1.4)
               MCA state: app (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA bml: r2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: adapt (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: self (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: libnbc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sync (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: cuda (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: ucc (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: tuned (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: basic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: inter (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA coll: han (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: two_phase (MCA v2.1.0, API v2.0.0, Component v4.1.4)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA io: romio321 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA mtl: ofi (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                  MCA op: avx (MCA v2.1.0, API v1.0.0, Component v4.1.4)
                 MCA osc: monitoring (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: pt2pt (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v4.1.4)
                 MCA pml: v (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ucx (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: monitoring (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: ob1 (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA pml: cm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                 MCA rte: orte (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v4.1.4)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component v4.1.4)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component v4.1.4)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v4.1.4)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component v4.1.4)

@hppritcha
Copy link
Member

Sorry for the delay in responding. I see you actually do have both ucx and ofi libfabric installed on the systems. To make sure we aren't trying to debug one of these could you rerun with

stuff preceding mpirun --verbose -n 8 --mca pml ob1 --mca btl self,vader /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c

and see if the test runs on both systems?

@richardbeare
Copy link
Author

On the laptop:

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun -n 8 --mca pml ob1 --mca btl self,vader /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting

On the VM

OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun -n 8 --mca pml ob1 --mca btl self,vader ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c 
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 2 exiting
Process 4 exiting
Process 5 exiting
Process 1 exiting
Process 3 exiting
Process 7 exiting
Process 6 exiting

@hppritcha
Copy link
Member

Thanks. Now it would be useful to see if ucx transport works for both systems.
Could you try first

stuff preceding mpirun --verbose -n 8 --mca pml ucx  /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c

and then

stuff preceding mpirun --verbose -n 8 --mca pml cm /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c

@richardbeare
Copy link
Author

Laptop:

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8  --mca pml ucx  /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 5 exiting
Process 6 exiting
Process 7 exiting
Process 2 exiting
Process 3 exiting
Process 4 exiting
Process 1 exiting

Hangs ..

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8  --mca pml cm  /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1

Cloud VM

OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8  --mca pml ucx  ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.

This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.

  Host:      easybuild
  Framework: pml
--------------------------------------------------------------------------
[easybuild:210384] PML ucx cannot be selected
[easybuild:210373] PML ucx cannot be selected
[easybuild:210369] 2 more processes have sent help message help-mca-base.txt / find-available:none found
[easybuild:210369] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
OMPI_MCA_rmaps_base_oversubscribe=1 ~/.local/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8  --mca pml cm  ~/.local/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 3 exiting
Process 5 exiting
Process 4 exiting
Process 6 exiting
Process 7 exiting

@hppritcha
Copy link
Member

Thanks! This is pointing to an issue using OFI libfabric on your laptop. I'm not sure this is worth pursing further as we generally do not recommend using OFI libfabric on a single node. Nevertheless, to look further into this can you report which release of libfabric easybuild installed on your laptop? I'm assuming that's how OFI libfabric got installed on the laptop.

@richardbeare
Copy link
Author

Thanks,
There was also a system version of libfabric, installed as a dependency of a system vtk, I think. I had hoped that removing and rebuilding would solve the issue, but no luck.

However, I may have some clues.

The libfabric version installed by easy build is the same on both the laptop and cloud vm:

easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/libfabric.so
easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/libfabric.so.1.19.1

on VM

cat .local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig/libfabric.pc
prefix=/home/ubuntu/.local/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: libfabric
Description: OFI-WG libfabric
URL: https://github.com/ofiwg/libfabric.git
Version: 1.16.1
Requires:
Cflags: -I${includedir}
Libs: -L${libdir} -lfabric 
Libs.private:  -lrt -lnuma -libverbs -luuid -lefa -latomic -lpthread -ldl -lm
Requires.private:

on VM:

cat /slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0/lib/pkgconfig/libfabric.pc 
prefix=/slowdata/richardb/easybuild/software/libfabric/1.16.1-GCCcore-12.2.0
exec_prefix=${prefix}
libdir=${exec_prefix}/lib
includedir=${prefix}/include

Name: libfabric
Description: OFI-WG libfabric
URL: https://github.com/ofiwg/libfabric.git
Version: 1.16.1
Requires:
Cflags: -I${includedir}
Libs: -L${libdir} -lfabric 
Libs.private:  -lrt -lnuma -libverbs -luuid -lefa -latomic -lpthread -ldl -lm -lcudart -lcuda
Requires.private:

So the laptop version is linking cuda stuff, which may explain the problems...

I'll see if I can figure out how to turn it off.

@richardbeare
Copy link
Author

richardbeare commented Apr 10, 2024

No luck, I'm afraid. I've added a --with-cuda=no to the libfabric configuration, and cuda doesn't show up as a dependency in pkgconfig. Which comes back to the question as to why cm is selected as the default on this laptop.

I now have another potential cause of the problem - the laptop has docker installed, which leads to a bunch of network interfaces being available. I notice that the fi_info command is reporting entries like:

provider: tcp;ofi_rxm
    fabric: 172.17.0.0/16
    domain: docker0
    version: 116.10
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
    fabric: 192.168.122.0/24
    domain: virbr0
    version: 116.10
    type: FI_EP_RDM
    protocol: FI_PROTO_RXM
provider: tcp;ofi_rxm
    fabric: 192.168.0.0/24
    domain: wlp59s0
    version: 116.10
    type: FI_EP_RDM

where the last one is the real wifi interface.

Finally, the ring test completes if I turn off wifi and plug in an ethernet cable, although there are warnings:

 OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8  /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c 
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: duvel
  Location: mtl_ofi_component.c:512
  Error: Invalid argument (22)
--------------------------------------------------------------------------
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enxc8f750d69d6b (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 2 exiting
Process 4 exiting
Process 5 exiting
Process 3 exiting
Process 7 exiting
Process 6 exiting
[duvel:3056825] 7 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[duvel:3056825] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@bosilca
Copy link
Member

bosilca commented Apr 10, 2024

enable pml_base_verbose by setting it to 50 and you shall get your answer.

@richardbeare
Copy link
Author

On wifi:

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml_base_verbose 50 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c 
[duvel:3794600] mca: base: components_register: registering framework pml components
[duvel:3794600] mca: base: components_register: found loaded component v
[duvel:3794600] mca: base: components_register: component v register function successful
[duvel:3794600] mca: base: components_register: found loaded component monitoring
[duvel:3794600] mca: base: components_register: component monitoring register function successful
[duvel:3794600] mca: base: components_register: found loaded component cm
[duvel:3794600] mca: base: components_register: component cm register function successful
[duvel:3794600] mca: base: components_register: found loaded component ucx
[duvel:3794600] mca: base: components_register: component ucx register function successful
[duvel:3794600] mca: base: components_register: found loaded component ob1
[duvel:3794600] mca: base: components_register: component ob1 register function successful
[duvel:3794600] mca: base: components_open: opening pml components
[duvel:3794600] mca: base: components_open: found loaded component v
[duvel:3794600] mca: base: components_open: component v open function successful
[duvel:3794600] mca: base: components_open: found loaded component monitoring
[duvel:3794600] mca: base: components_open: component monitoring open function successful
[duvel:3794600] mca: base: components_open: found loaded component cm
[duvel:3794600] mca: base: components_open: component cm open function successful
[duvel:3794600] mca: base: components_open: found loaded component ucx
[duvel:3794600] mca: base: components_open: component ucx open function successful
[duvel:3794600] mca: base: components_open: found loaded component ob1
[duvel:3794600] mca: base: components_open: component ob1 open function successful
[duvel:3794600] select: component v not in the include list
[duvel:3794600] select: component monitoring not in the include list
[duvel:3794600] select: initializing pml component cm
[duvel:3794600] select: init returned priority 25
[duvel:3794600] select: initializing pml component ucx
[duvel:3794600] select: init returned priority 19
[duvel:3794600] select: initializing pml component ob1
[duvel:3794600] select: init returned priority 20
[duvel:3794600] selected cm best priority 25
[duvel:3794600] select: component cm selected
[duvel:3794600] select: component ucx not selected / finalized
[duvel:3794600] select: component ob1 not selected / finalized
[duvel:3794600] mca: base: close: component v closed
[duvel:3794600] mca: base: close: unloading component v
[duvel:3794600] mca: base: close: component monitoring closed
[duvel:3794600] mca: base: close: unloading component monitoring
[duvel:3794600] mca: base: close: component ucx closed
[duvel:3794600] mca: base: close: unloading component ucx
[duvel:3794600] mca: base: close: component ob1 closed
[duvel:3794600] mca: base: close: unloading component ob1
[duvel:3794599] mca: base: components_register: registering framework pml components
[duvel:3794599] mca: base: components_register: found loaded component v
[duvel:3794599] mca: base: components_register: component v register function successful
[duvel:3794599] mca: base: components_register: found loaded component monitoring
[duvel:3794599] mca: base: components_register: component monitoring register function successful
[duvel:3794599] mca: base: components_register: found loaded component cm
[duvel:3794599] mca: base: components_register: component cm register function successful
[duvel:3794599] mca: base: components_register: found loaded component ucx
[duvel:3794599] mca: base: components_register: component ucx register function successful
[duvel:3794599] mca: base: components_register: found loaded component ob1
[duvel:3794599] mca: base: components_register: component ob1 register function successful
[duvel:3794599] mca: base: components_open: opening pml components
[duvel:3794599] mca: base: components_open: found loaded component v
[duvel:3794599] mca: base: components_open: component v open function successful
[duvel:3794599] mca: base: components_open: found loaded component monitoring
[duvel:3794599] mca: base: components_open: component monitoring open function successful
[duvel:3794599] mca: base: components_open: found loaded component cm
[duvel:3794599] mca: base: components_open: component cm open function successful
[duvel:3794599] mca: base: components_open: found loaded component ucx
[duvel:3794598] mca: base: components_register: registering framework pml components
[duvel:3794598] mca: base: components_register: found loaded component v
[duvel:3794598] mca: base: components_register: component v register function successful
[duvel:3794598] mca: base: components_register: found loaded component monitoring
[duvel:3794598] mca: base: components_register: component monitoring register function successful
[duvel:3794598] mca: base: components_register: found loaded component cm
[duvel:3794598] mca: base: components_register: component cm register function successful
[duvel:3794598] mca: base: components_register: found loaded component ucx
[duvel:3794598] mca: base: components_register: component ucx register function successful
[duvel:3794598] mca: base: components_register: found loaded component ob1
[duvel:3794598] mca: base: components_register: component ob1 register function successful
[duvel:3794598] mca: base: components_open: opening pml components
[duvel:3794598] mca: base: components_open: found loaded component v
[duvel:3794598] mca: base: components_open: component v open function successful
[duvel:3794598] mca: base: components_open: found loaded component monitoring
[duvel:3794598] mca: base: components_open: component monitoring open function successful
[duvel:3794598] mca: base: components_open: found loaded component cm
[duvel:3794598] mca: base: components_open: component cm open function successful
[duvel:3794598] mca: base: components_open: found loaded component ucx
[duvel:3794602] mca: base: components_register: registering framework pml components
[duvel:3794602] mca: base: components_register: found loaded component v
[duvel:3794602] mca: base: components_register: component v register function successful
[duvel:3794602] mca: base: components_register: found loaded component monitoring
[duvel:3794602] mca: base: components_register: component monitoring register function successful
[duvel:3794602] mca: base: components_register: found loaded component cm
[duvel:3794602] mca: base: components_register: component cm register function successful
[duvel:3794602] mca: base: components_register: found loaded component ucx
[duvel:3794602] mca: base: components_register: component ucx register function successful
[duvel:3794602] mca: base: components_register: found loaded component ob1
[duvel:3794597] mca: base: components_register: registering framework pml components
[duvel:3794597] mca: base: components_register: found loaded component v
[duvel:3794597] mca: base: components_register: component v register function successful
[duvel:3794597] mca: base: components_register: found loaded component monitoring
[duvel:3794602] mca: base: components_register: component ob1 register function successful
[duvel:3794602] mca: base: components_open: opening pml components
[duvel:3794602] mca: base: components_open: found loaded component v
[duvel:3794597] mca: base: components_register: component monitoring register function successful
[duvel:3794597] mca: base: components_register: found loaded component cm
[duvel:3794597] mca: base: components_register: component cm register function successful
[duvel:3794597] mca: base: components_register: found loaded component ucx
[duvel:3794597] mca: base: components_register: component ucx register function successful
[duvel:3794602] mca: base: components_open: component v open function successful
[duvel:3794602] mca: base: components_open: found loaded component monitoring
[duvel:3794602] mca: base: components_open: component monitoring open function successful
[duvel:3794602] mca: base: components_open: found loaded component cm
[duvel:3794597] mca: base: components_register: found loaded component ob1
[duvel:3794597] mca: base: components_register: component ob1 register function successful
[duvel:3794597] mca: base: components_open: opening pml components
[duvel:3794597] mca: base: components_open: found loaded component v
[duvel:3794602] mca: base: components_open: component cm open function successful
[duvel:3794602] mca: base: components_open: found loaded component ucx
[duvel:3794597] mca: base: components_open: component v open function successful
[duvel:3794597] mca: base: components_open: found loaded component monitoring
[duvel:3794597] mca: base: components_open: component monitoring open function successful
[duvel:3794597] mca: base: components_open: found loaded component cm
[duvel:3794597] mca: base: components_open: component cm open function successful
[duvel:3794597] mca: base: components_open: found loaded component ucx
[duvel:3794601] mca: base: components_register: registering framework pml components
[duvel:3794601] mca: base: components_register: found loaded component v
[duvel:3794601] mca: base: components_register: component v register function successful
[duvel:3794601] mca: base: components_register: found loaded component monitoring
[duvel:3794601] mca: base: components_register: component monitoring register function successful
[duvel:3794601] mca: base: components_register: found loaded component cm
[duvel:3794601] mca: base: components_register: component cm register function successful
[duvel:3794601] mca: base: components_register: found loaded component ucx
[duvel:3794601] mca: base: components_register: component ucx register function successful
[duvel:3794601] mca: base: components_register: found loaded component ob1
[duvel:3794601] mca: base: components_register: component ob1 register function successful
[duvel:3794601] mca: base: components_open: opening pml components
[duvel:3794601] mca: base: components_open: found loaded component v
[duvel:3794601] mca: base: components_open: component v open function successful
[duvel:3794601] mca: base: components_open: found loaded component monitoring
[duvel:3794601] mca: base: components_open: component monitoring open function successful
[duvel:3794601] mca: base: components_open: found loaded component cm
[duvel:3794601] mca: base: components_open: component cm open function successful
[duvel:3794601] mca: base: components_open: found loaded component ucx
[duvel:3794595] mca: base: components_register: registering framework pml components
[duvel:3794595] mca: base: components_register: found loaded component v
[duvel:3794595] mca: base: components_register: component v register function successful
[duvel:3794599] mca: base: components_open: component ucx open function successful
[duvel:3794599] mca: base: components_open: found loaded component ob1
[duvel:3794599] mca: base: components_open: component ob1 open function successful
[duvel:3794595] mca: base: components_register: found loaded component monitoring
[duvel:3794595] mca: base: components_register: component monitoring register function successful
[duvel:3794595] mca: base: components_register: found loaded component cm
[duvel:3794595] mca: base: components_register: component cm register function successful
[duvel:3794595] mca: base: components_register: found loaded component ucx
[duvel:3794595] mca: base: components_register: component ucx register function successful
[duvel:3794595] mca: base: components_register: found loaded component ob1
[duvel:3794595] mca: base: components_register: component ob1 register function successful
[duvel:3794595] mca: base: components_open: opening pml components
[duvel:3794595] mca: base: components_open: found loaded component v
[duvel:3794596] mca: base: components_register: registering framework pml components
[duvel:3794596] mca: base: components_register: found loaded component v
[duvel:3794596] mca: base: components_register: component v register function successful
[duvel:3794596] mca: base: components_register: found loaded component monitoring
[duvel:3794595] mca: base: components_open: component v open function successful
[duvel:3794595] mca: base: components_open: found loaded component monitoring
[duvel:3794595] mca: base: components_open: component monitoring open function successful
[duvel:3794595] mca: base: components_open: found loaded component cm
[duvel:3794596] mca: base: components_register: component monitoring register function successful
[duvel:3794596] mca: base: components_register: found loaded component cm
[duvel:3794596] mca: base: components_register: component cm register function successful
[duvel:3794596] mca: base: components_register: found loaded component ucx
[duvel:3794595] mca: base: components_open: component cm open function successful
[duvel:3794595] mca: base: components_open: found loaded component ucx
[duvel:3794596] mca: base: components_register: component ucx register function successful
[duvel:3794596] mca: base: components_register: found loaded component ob1
[duvel:3794596] mca: base: components_register: component ob1 register function successful
[duvel:3794596] mca: base: components_open: opening pml components
[duvel:3794596] mca: base: components_open: found loaded component v
[duvel:3794596] mca: base: components_open: component v open function successful
[duvel:3794596] mca: base: components_open: found loaded component monitoring
[duvel:3794596] mca: base: components_open: component monitoring open function successful
[duvel:3794596] mca: base: components_open: found loaded component cm
[duvel:3794596] mca: base: components_open: component cm open function successful
[duvel:3794596] mca: base: components_open: found loaded component ucx
[duvel:3794602] mca: base: components_open: component ucx open function successful
[duvel:3794602] mca: base: components_open: found loaded component ob1
[duvel:3794602] mca: base: components_open: component ob1 open function successful
[duvel:3794597] mca: base: components_open: component ucx open function successful
[duvel:3794597] mca: base: components_open: found loaded component ob1
[duvel:3794597] mca: base: components_open: component ob1 open function successful
[duvel:3794598] mca: base: components_open: component ucx open function successful
[duvel:3794598] mca: base: components_open: found loaded component ob1
[duvel:3794598] mca: base: components_open: component ob1 open function successful
[duvel:3794599] select: component v not in the include list
[duvel:3794599] select: component monitoring not in the include list
[duvel:3794599] select: initializing pml component cm
[duvel:3794597] select: component v not in the include list
[duvel:3794597] select: component monitoring not in the include list
[duvel:3794597] select: initializing pml component cm
[duvel:3794602] select: component v not in the include list
[duvel:3794602] select: component monitoring not in the include list
[duvel:3794602] select: initializing pml component cm
[duvel:3794599] select: init returned priority 25
[duvel:3794599] select: initializing pml component ucx
[duvel:3794601] mca: base: components_open: component ucx open function successful
[duvel:3794601] mca: base: components_open: found loaded component ob1
[duvel:3794601] mca: base: components_open: component ob1 open function successful
[duvel:3794597] select: init returned priority 25
[duvel:3794597] select: initializing pml component ucx
[duvel:3794602] select: init returned priority 25
[duvel:3794602] select: initializing pml component ucx
[duvel:3794598] select: component v not in the include list
[duvel:3794598] select: component monitoring not in the include list
[duvel:3794598] select: initializing pml component cm
[duvel:3794595] mca: base: components_open: component ucx open function successful
[duvel:3794595] mca: base: components_open: found loaded component ob1
[duvel:3794595] mca: base: components_open: component ob1 open function successful
[duvel:3794596] mca: base: components_open: component ucx open function successful
[duvel:3794596] mca: base: components_open: found loaded component ob1
[duvel:3794596] mca: base: components_open: component ob1 open function successful
[duvel:3794598] select: init returned priority 25
[duvel:3794598] select: initializing pml component ucx
[duvel:3794595] select: component v not in the include list
[duvel:3794595] select: component monitoring not in the include list
[duvel:3794595] select: initializing pml component cm
[duvel:3794601] select: component v not in the include list
[duvel:3794601] select: component monitoring not in the include list
[duvel:3794601] select: initializing pml component cm
[duvel:3794595] select: init returned priority 25
[duvel:3794595] select: initializing pml component ucx
[duvel:3794596] select: component v not in the include list
[duvel:3794596] select: component monitoring not in the include list
[duvel:3794596] select: initializing pml component cm
[duvel:3794596] select: init returned priority 25
[duvel:3794596] select: initializing pml component ucx
[duvel:3794601] select: init returned priority 25
[duvel:3794601] select: initializing pml component ucx
[duvel:3794599] select: init returned priority 19
[duvel:3794599] select: initializing pml component ob1
[duvel:3794599] select: init returned priority 20
[duvel:3794599] selected cm best priority 25
[duvel:3794599] select: component cm selected
[duvel:3794602] select: init returned priority 19
[duvel:3794602] select: initializing pml component ob1
[duvel:3794602] select: init returned priority 20
[duvel:3794602] selected cm best priority 25
[duvel:3794602] select: component cm selected
[duvel:3794597] select: init returned priority 19
[duvel:3794597] select: initializing pml component ob1
[duvel:3794597] select: init returned priority 20
[duvel:3794597] selected cm best priority 25
[duvel:3794597] select: component cm selected
[duvel:3794599] select: component ucx not selected / finalized
[duvel:3794599] select: component ob1 not selected / finalized
[duvel:3794599] mca: base: close: component v closed
[duvel:3794599] mca: base: close: unloading component v
[duvel:3794599] mca: base: close: component monitoring closed
[duvel:3794599] mca: base: close: unloading component monitoring
[duvel:3794599] mca: base: close: component ucx closed
[duvel:3794599] mca: base: close: unloading component ucx
[duvel:3794599] mca: base: close: component ob1 closed
[duvel:3794599] mca: base: close: unloading component ob1
[duvel:3794602] select: component ucx not selected / finalized
[duvel:3794602] select: component ob1 not selected / finalized
[duvel:3794602] mca: base: close: component v closed
[duvel:3794602] mca: base: close: unloading component v
[duvel:3794602] mca: base: close: component monitoring closed
[duvel:3794602] mca: base: close: unloading component monitoring
[duvel:3794597] select: component ucx not selected / finalized
[duvel:3794597] select: component ob1 not selected / finalized
[duvel:3794597] mca: base: close: component v closed
[duvel:3794597] mca: base: close: unloading component v
[duvel:3794597] mca: base: close: component monitoring closed
[duvel:3794597] mca: base: close: unloading component monitoring
[duvel:3794602] mca: base: close: component ucx closed
[duvel:3794602] mca: base: close: unloading component ucx
[duvel:3794602] mca: base: close: component ob1 closed
[duvel:3794602] mca: base: close: unloading component ob1
[duvel:3794597] mca: base: close: component ucx closed
[duvel:3794597] mca: base: close: unloading component ucx
[duvel:3794597] mca: base: close: component ob1 closed
[duvel:3794597] mca: base: close: unloading component ob1
[duvel:3794595] select: init returned priority 19
[duvel:3794595] select: initializing pml component ob1
[duvel:3794595] select: init returned priority 20
[duvel:3794595] selected cm best priority 25
[duvel:3794595] select: component cm selected
[duvel:3794598] select: init returned priority 19
[duvel:3794598] select: initializing pml component ob1
[duvel:3794598] select: init returned priority 20
[duvel:3794598] selected cm best priority 25
[duvel:3794598] select: component cm selected
[duvel:3794596] select: init returned priority 19
[duvel:3794596] select: initializing pml component ob1
[duvel:3794596] select: init returned priority 20
[duvel:3794596] selected cm best priority 25
[duvel:3794596] select: component cm selected
[duvel:3794595] select: component ucx not selected / finalized
[duvel:3794595] select: component ob1 not selected / finalized
[duvel:3794595] mca: base: close: component v closed
[duvel:3794595] mca: base: close: unloading component v
[duvel:3794595] mca: base: close: component monitoring closed
[duvel:3794595] mca: base: close: unloading component monitoring
[duvel:3794595] mca: base: close: component ucx closed
[duvel:3794595] mca: base: close: unloading component ucx
[duvel:3794595] mca: base: close: component ob1 closed
[duvel:3794595] mca: base: close: unloading component ob1
[duvel:3794596] select: component ucx not selected / finalized
[duvel:3794596] select: component ob1 not selected / finalized
[duvel:3794596] mca: base: close: component v closed
[duvel:3794596] mca: base: close: unloading component v
[duvel:3794596] mca: base: close: component monitoring closed
[duvel:3794596] mca: base: close: unloading component monitoring
[duvel:3794598] select: component ucx not selected / finalized
[duvel:3794598] select: component ob1 not selected / finalized
[duvel:3794598] mca: base: close: component v closed
[duvel:3794598] mca: base: close: unloading component v
[duvel:3794596] mca: base: close: component ucx closed
[duvel:3794596] mca: base: close: unloading component ucx
[duvel:3794598] mca: base: close: component monitoring closed
[duvel:3794598] mca: base: close: unloading component monitoring
[duvel:3794596] mca: base: close: component ob1 closed
[duvel:3794596] mca: base: close: unloading component ob1
[duvel:3794598] mca: base: close: component ucx closed
[duvel:3794598] mca: base: close: unloading component ucx
[duvel:3794598] mca: base: close: component ob1 closed
[duvel:3794598] mca: base: close: unloading component ob1
[duvel:3794601] select: init returned priority 19
[duvel:3794601] select: initializing pml component ob1
[duvel:3794601] select: init returned priority 20
[duvel:3794601] selected cm best priority 25
[duvel:3794601] select: component cm selected
[duvel:3794601] select: component ucx not selected / finalized
[duvel:3794601] select: component ob1 not selected / finalized
[duvel:3794601] mca: base: close: component v closed
[duvel:3794601] mca: base: close: unloading component v
[duvel:3794601] mca: base: close: component monitoring closed
[duvel:3794601] mca: base: close: unloading component monitoring
[duvel:3794601] mca: base: close: component ucx closed
[duvel:3794601] mca: base: close: unloading component ucx
[duvel:3794601] mca: base: close: component ob1 closed
[duvel:3794601] mca: base: close: unloading component ob1
[duvel:3794601] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794595] check:select: PML check not necessary on self
[duvel:3794600] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794597] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794602] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794599] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794596] check:select: checking my pml cm against process [[61709,1],0] pml cm
[duvel:3794598] check:select: checking my pml cm against process [[61709,1],0] pml cm
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1

on wired network

OMPI_MCA_rmaps_base_oversubscribe=1 /slowdata/richardb/easybuild/software/OpenMPI/4.1.4-GCC-12.2.0/bin/mpirun --verbose -n 8 --mca pml_base_verbose 50 /slowdata/richardb/easybuild/build/OpenMPI/4.1.4/GCC-12.2.0/mpi_test_ring_c 
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794239] mca: base: components_register: registering framework pml components
[duvel:3794239] mca: base: components_register: found loaded component v
[duvel:3794239] mca: base: components_register: component v register function successful
[duvel:3794239] mca: base: components_register: found loaded component monitoring
[duvel:3794239] mca: base: components_register: component monitoring register function successful
[duvel:3794239] mca: base: components_register: found loaded component cm
[duvel:3794239] mca: base: components_register: component cm register function successful
[duvel:3794239] mca: base: components_register: found loaded component ucx
[duvel:3794239] mca: base: components_register: component ucx register function successful
[duvel:3794239] mca: base: components_register: found loaded component ob1
[duvel:3794239] mca: base: components_register: component ob1 register function successful
[duvel:3794239] mca: base: components_open: opening pml components
[duvel:3794239] mca: base: components_open: found loaded component v
[duvel:3794239] mca: base: components_open: component v open function successful
[duvel:3794239] mca: base: components_open: found loaded component monitoring
[duvel:3794239] mca: base: components_open: component monitoring open function successful
[duvel:3794239] mca: base: components_open: found loaded component cm
[duvel:3794239] mca: base: components_open: component cm open function successful
[duvel:3794239] mca: base: components_open: found loaded component ucx
[duvel:3794241] mca: base: components_register: registering framework pml components
[duvel:3794241] mca: base: components_register: found loaded component v
[duvel:3794241] mca: base: components_register: component v register function successful
[duvel:3794241] mca: base: components_register: found loaded component monitoring
[duvel:3794241] mca: base: components_register: component monitoring register function successful
[duvel:3794241] mca: base: components_register: found loaded component cm
[duvel:3794241] mca: base: components_register: component cm register function successful
[duvel:3794237] mca: base: components_register: registering framework pml components
[duvel:3794237] mca: base: components_register: found loaded component v
[duvel:3794241] mca: base: components_register: found loaded component ucx
[duvel:3794237] mca: base: components_register: component v register function successful
[duvel:3794237] mca: base: components_register: found loaded component monitoring
[duvel:3794237] mca: base: components_register: component monitoring register function successful
[duvel:3794237] mca: base: components_register: found loaded component cm
[duvel:3794237] mca: base: components_register: component cm register function successful
[duvel:3794241] mca: base: components_register: component ucx register function successful
[duvel:3794241] mca: base: components_register: found loaded component ob1
[duvel:3794237] mca: base: components_register: found loaded component ucx
[duvel:3794241] mca: base: components_register: component ob1 register function successful
[duvel:3794241] mca: base: components_open: opening pml components
[duvel:3794241] mca: base: components_open: found loaded component v
[duvel:3794237] mca: base: components_register: component ucx register function successful
[duvel:3794237] mca: base: components_register: found loaded component ob1
[duvel:3794237] mca: base: components_register: component ob1 register function successful
[duvel:3794237] mca: base: components_open: opening pml components
[duvel:3794237] mca: base: components_open: found loaded component v
[duvel:3794241] mca: base: components_open: component v open function successful
[duvel:3794241] mca: base: components_open: found loaded component monitoring
[duvel:3794241] mca: base: components_open: component monitoring open function successful
[duvel:3794241] mca: base: components_open: found loaded component cm
[duvel:3794235] mca: base: components_register: registering framework pml components
[duvel:3794235] mca: base: components_register: found loaded component v
[duvel:3794235] mca: base: components_register: component v register function successful
[duvel:3794235] mca: base: components_register: found loaded component monitoring
[duvel:3794235] mca: base: components_register: component monitoring register function successful
[duvel:3794235] mca: base: components_register: found loaded component cm
[duvel:3794238] mca: base: components_register: registering framework pml components
[duvel:3794238] mca: base: components_register: found loaded component v
[duvel:3794238] mca: base: components_register: component v register function successful
[duvel:3794235] mca: base: components_register: component cm register function successful
[duvel:3794235] mca: base: components_register: found loaded component ucx
[duvel:3794238] mca: base: components_register: found loaded component monitoring
[duvel:3794238] mca: base: components_register: component monitoring register function successful
[duvel:3794237] mca: base: components_open: component v open function successful
[duvel:3794237] mca: base: components_open: found loaded component monitoring
[duvel:3794237] mca: base: components_open: component monitoring open function successful
[duvel:3794237] mca: base: components_open: found loaded component cm
[duvel:3794238] mca: base: components_register: found loaded component cm
[duvel:3794235] mca: base: components_register: component ucx register function successful
[duvel:3794238] mca: base: components_register: component cm register function successful
[duvel:3794238] mca: base: components_register: found loaded component ucx
[duvel:3794235] mca: base: components_register: found loaded component ob1
[duvel:3794235] mca: base: components_register: component ob1 register function successful
[duvel:3794241] mca: base: components_open: component cm open function successful
[duvel:3794241] mca: base: components_open: found loaded component ucx
[duvel:3794238] mca: base: components_register: component ucx register function successful
[duvel:3794235] mca: base: components_open: opening pml components
[duvel:3794235] mca: base: components_open: found loaded component v
[duvel:3794238] mca: base: components_register: found loaded component ob1
[duvel:3794238] mca: base: components_register: component ob1 register function successful
[duvel:3794238] mca: base: components_open: opening pml components
[duvel:3794238] mca: base: components_open: found loaded component v
[duvel:3794237] mca: base: components_open: component cm open function successful
[duvel:3794237] mca: base: components_open: found loaded component ucx
[duvel:3794235] mca: base: components_open: component v open function successful
[duvel:3794235] mca: base: components_open: found loaded component monitoring
[duvel:3794235] mca: base: components_open: component monitoring open function successful
[duvel:3794235] mca: base: components_open: found loaded component cm
[duvel:3794238] mca: base: components_open: component v open function successful
[duvel:3794238] mca: base: components_open: found loaded component monitoring
[duvel:3794238] mca: base: components_open: component monitoring open function successful
[duvel:3794238] mca: base: components_open: found loaded component cm
[duvel:3794234] mca: base: components_register: registering framework pml components
[duvel:3794234] mca: base: components_register: found loaded component v
[duvel:3794234] mca: base: components_register: component v register function successful
[duvel:3794234] mca: base: components_register: found loaded component monitoring
[duvel:3794235] mca: base: components_open: component cm open function successful
[duvel:3794235] mca: base: components_open: found loaded component ucx
[duvel:3794234] mca: base: components_register: component monitoring register function successful
[duvel:3794234] mca: base: components_register: found loaded component cm
[duvel:3794234] mca: base: components_register: component cm register function successful
[duvel:3794238] mca: base: components_open: component cm open function successful
[duvel:3794238] mca: base: components_open: found loaded component ucx
[duvel:3794234] mca: base: components_register: found loaded component ucx
[duvel:3794234] mca: base: components_register: component ucx register function successful
[duvel:3794234] mca: base: components_register: found loaded component ob1
[duvel:3794240] mca: base: components_register: registering framework pml components
[duvel:3794240] mca: base: components_register: found loaded component v
[duvel:3794234] mca: base: components_register: component ob1 register function successful
[duvel:3794240] mca: base: components_register: component v register function successful
[duvel:3794234] mca: base: components_open: opening pml components
[duvel:3794234] mca: base: components_open: found loaded component v
[duvel:3794240] mca: base: components_register: found loaded component monitoring
[duvel:3794240] mca: base: components_register: component monitoring register function successful
[duvel:3794240] mca: base: components_register: found loaded component cm
[duvel:3794240] mca: base: components_register: component cm register function successful
[duvel:3794240] mca: base: components_register: found loaded component ucx
[duvel:3794240] mca: base: components_register: component ucx register function successful
[duvel:3794240] mca: base: components_register: found loaded component ob1
[duvel:3794240] mca: base: components_register: component ob1 register function successful
[duvel:3794234] mca: base: components_open: component v open function successful
[duvel:3794234] mca: base: components_open: found loaded component monitoring
[duvel:3794234] mca: base: components_open: component monitoring open function successful
[duvel:3794234] mca: base: components_open: found loaded component cm
[duvel:3794240] mca: base: components_open: opening pml components
[duvel:3794240] mca: base: components_open: found loaded component v
[duvel:3794240] mca: base: components_open: component v open function successful
[duvel:3794240] mca: base: components_open: found loaded component monitoring
[duvel:3794240] mca: base: components_open: component monitoring open function successful
[duvel:3794240] mca: base: components_open: found loaded component cm
[duvel:3794234] mca: base: components_open: component cm open function successful
[duvel:3794234] mca: base: components_open: found loaded component ucx
[duvel:3794240] mca: base: components_open: component cm open function successful
[duvel:3794240] mca: base: components_open: found loaded component ucx
[duvel:3794236] mca: base: components_register: registering framework pml components
[duvel:3794236] mca: base: components_register: found loaded component v
[duvel:3794236] mca: base: components_register: component v register function successful
[duvel:3794236] mca: base: components_register: found loaded component monitoring
[duvel:3794236] mca: base: components_register: component monitoring register function successful
[duvel:3794236] mca: base: components_register: found loaded component cm
[duvel:3794236] mca: base: components_register: component cm register function successful
[duvel:3794236] mca: base: components_register: found loaded component ucx
[duvel:3794236] mca: base: components_register: component ucx register function successful
[duvel:3794236] mca: base: components_register: found loaded component ob1
[duvel:3794236] mca: base: components_register: component ob1 register function successful
[duvel:3794236] mca: base: components_open: opening pml components
[duvel:3794236] mca: base: components_open: found loaded component v
[duvel:3794236] mca: base: components_open: component v open function successful
[duvel:3794236] mca: base: components_open: found loaded component monitoring
[duvel:3794236] mca: base: components_open: component monitoring open function successful
[duvel:3794236] mca: base: components_open: found loaded component cm
[duvel:3794236] mca: base: components_open: component cm open function successful
[duvel:3794236] mca: base: components_open: found loaded component ucx
[duvel:3794239] mca: base: components_open: component ucx open function successful
[duvel:3794239] mca: base: components_open: found loaded component ob1
[duvel:3794239] mca: base: components_open: component ob1 open function successful
[duvel:3794238] mca: base: components_open: component ucx open function successful
[duvel:3794238] mca: base: components_open: found loaded component ob1
[duvel:3794238] mca: base: components_open: component ob1 open function successful
[duvel:3794234] mca: base: components_open: component ucx open function successful
[duvel:3794234] mca: base: components_open: found loaded component ob1
[duvel:3794234] mca: base: components_open: component ob1 open function successful
[duvel:3794235] mca: base: components_open: component ucx open function successful
[duvel:3794235] mca: base: components_open: found loaded component ob1
[duvel:3794235] mca: base: components_open: component ob1 open function successful
[duvel:3794240] mca: base: components_open: component ucx open function successful
[duvel:3794240] mca: base: components_open: found loaded component ob1
[duvel:3794240] mca: base: components_open: component ob1 open function successful
[duvel:3794241] mca: base: components_open: component ucx open function successful
[duvel:3794241] mca: base: components_open: found loaded component ob1
[duvel:3794241] mca: base: components_open: component ob1 open function successful
[duvel:3794237] mca: base: components_open: component ucx open function successful
[duvel:3794237] mca: base: components_open: found loaded component ob1
[duvel:3794237] mca: base: components_open: component ob1 open function successful
[duvel:3794239] select: component v not in the include list
[duvel:3794239] select: component monitoring not in the include list
[duvel:3794239] select: initializing pml component cm
[duvel:3794238] select: component v not in the include list
[duvel:3794238] select: component monitoring not in the include list
[duvel:3794238] select: initializing pml component cm
duvel:rank4.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank5: PSM3 can't open nic unit: 1 (err=23)
duvel:rank4: PSM3 can't open nic unit: 1 (err=23)
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: duvel
  Location: mtl_ofi_component.c:512
  Error: Invalid argument (22)
--------------------------------------------------------------------------
[duvel:3794235] select: component v not in the include list
[duvel:3794235] select: component monitoring not in the include list
[duvel:3794235] select: initializing pml component cm
[duvel:3794234] select: component v not in the include list
[duvel:3794234] select: component monitoring not in the include list
[duvel:3794234] select: initializing pml component cm
[duvel:3794240] select: component v not in the include list
[duvel:3794241] select: component v not in the include list
[duvel:3794241] select: component monitoring not in the include list
[duvel:3794241] select: initializing pml component cm
[duvel:3794240] select: component monitoring not in the include list
[duvel:3794240] select: initializing pml component cm
[duvel:3794237] select: component v not in the include list
[duvel:3794237] select: component monitoring not in the include list
[duvel:3794237] select: initializing pml component cm
[duvel:3794239] select: init returned failure for component cm
[duvel:3794239] select: initializing pml component ucx
[duvel:3794238] select: init returned failure for component cm
[duvel:3794238] select: initializing pml component ucx
[duvel:3794236] mca: base: components_open: component ucx open function successful
[duvel:3794236] mca: base: components_open: found loaded component ob1
[duvel:3794236] mca: base: components_open: component ob1 open function successful
duvel:rank1.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank0.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank1: PSM3 can't open nic unit: 1 (err=23)
duvel:rank0: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank7.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank3: PSM3 can't open nic unit: 1 (err=23)
duvel:rank6: PSM3 can't open nic unit: 1 (err=23)
duvel:rank7: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794235] select: init returned failure for component cm
[duvel:3794234] select: init returned failure for component cm
[duvel:3794234] select: initializing pml component ucx
[duvel:3794235] select: initializing pml component ucx
[duvel:3794237] select: init returned failure for component cm
[duvel:3794237] select: initializing pml component ucx
[duvel:3794241] select: init returned failure for component cm
[duvel:3794241] select: initializing pml component ucx
[duvel:3794240] select: init returned failure for component cm
[duvel:3794240] select: initializing pml component ucx
[duvel:3794236] select: component v not in the include list
[duvel:3794236] select: component monitoring not in the include list
[duvel:3794236] select: initializing pml component cm
duvel:rank2.mpi_test_ring_c: Failed to get enx0c3796b36b50 (unit 1) cpu set
duvel:rank2: PSM3 can't open nic unit: 1 (err=23)
[duvel:3794236] select: init returned failure for component cm
[duvel:3794236] select: initializing pml component ucx
[duvel:3794239] select: init returned priority 19
[duvel:3794239] select: initializing pml component ob1
[duvel:3794239] select: init returned priority 20
[duvel:3794239] selected ob1 best priority 20
[duvel:3794239] select: component ob1 selected
[duvel:3794238] select: init returned priority 19
[duvel:3794238] select: initializing pml component ob1
[duvel:3794238] select: init returned priority 20
[duvel:3794238] selected ob1 best priority 20
[duvel:3794238] select: component ob1 selected
[duvel:3794239] select: component ucx not selected / finalized
[duvel:3794239] mca: base: close: component v closed
[duvel:3794239] mca: base: close: unloading component v
[duvel:3794239] mca: base: close: component monitoring closed
[duvel:3794239] mca: base: close: unloading component monitoring
[duvel:3794239] mca: base: close: component cm closed
[duvel:3794239] mca: base: close: unloading component cm
[duvel:3794239] mca: base: close: component ucx closed
[duvel:3794239] mca: base: close: unloading component ucx
[duvel:3794238] select: component ucx not selected / finalized
[duvel:3794238] mca: base: close: component v closed
[duvel:3794238] mca: base: close: unloading component v
[duvel:3794238] mca: base: close: component monitoring closed
[duvel:3794238] mca: base: close: unloading component monitoring
[duvel:3794238] mca: base: close: component cm closed
[duvel:3794238] mca: base: close: unloading component cm
[duvel:3794238] mca: base: close: component ucx closed
[duvel:3794238] mca: base: close: unloading component ucx
[duvel:3794234] select: init returned priority 19
[duvel:3794234] select: initializing pml component ob1
[duvel:3794234] select: init returned priority 20
[duvel:3794234] selected ob1 best priority 20
[duvel:3794234] select: component ob1 selected
[duvel:3794234] select: component ucx not selected / finalized
[duvel:3794234] mca: base: close: component v closed
[duvel:3794234] mca: base: close: unloading component v
[duvel:3794234] mca: base: close: component monitoring closed
[duvel:3794234] mca: base: close: unloading component monitoring
[duvel:3794234] mca: base: close: component cm closed
[duvel:3794234] mca: base: close: unloading component cm
[duvel:3794234] mca: base: close: component ucx closed
[duvel:3794234] mca: base: close: unloading component ucx
[duvel:3794241] select: init returned priority 19
[duvel:3794241] select: initializing pml component ob1
[duvel:3794241] select: init returned priority 20
[duvel:3794241] selected ob1 best priority 20
[duvel:3794241] select: component ob1 selected
[duvel:3794240] select: init returned priority 19
[duvel:3794240] select: initializing pml component ob1
[duvel:3794240] select: init returned priority 20
[duvel:3794240] selected ob1 best priority 20
[duvel:3794240] select: component ob1 selected
[duvel:3794237] select: init returned priority 19
[duvel:3794237] select: initializing pml component ob1
[duvel:3794237] select: init returned priority 20
[duvel:3794237] selected ob1 best priority 20
[duvel:3794237] select: component ob1 selected
[duvel:3794235] select: init returned priority 19
[duvel:3794235] select: initializing pml component ob1
[duvel:3794235] select: init returned priority 20
[duvel:3794235] selected ob1 best priority 20
[duvel:3794235] select: component ob1 selected
[duvel:3794236] select: init returned priority 19
[duvel:3794236] select: initializing pml component ob1
[duvel:3794236] select: init returned priority 20
[duvel:3794236] selected ob1 best priority 20
[duvel:3794236] select: component ob1 selected
[duvel:3794236] select: component ucx not selected / finalized
[duvel:3794236] mca: base: close: component v closed
[duvel:3794236] mca: base: close: unloading component v
[duvel:3794236] mca: base: close: component monitoring closed
[duvel:3794236] mca: base: close: unloading component monitoring
[duvel:3794236] mca: base: close: component cm closed
[duvel:3794236] mca: base: close: unloading component cm
[duvel:3794236] mca: base: close: component ucx closed
[duvel:3794236] mca: base: close: unloading component ucx
[duvel:3794241] select: component ucx not selected / finalized
[duvel:3794241] mca: base: close: component v closed
[duvel:3794241] mca: base: close: unloading component v
[duvel:3794241] mca: base: close: component monitoring closed
[duvel:3794241] mca: base: close: unloading component monitoring
[duvel:3794241] mca: base: close: component cm closed
[duvel:3794241] mca: base: close: unloading component cm
[duvel:3794240] select: component ucx not selected / finalized
[duvel:3794240] mca: base: close: component v closed
[duvel:3794240] mca: base: close: unloading component v
[duvel:3794240] mca: base: close: component monitoring closed
[duvel:3794240] mca: base: close: unloading component monitoring
[duvel:3794237] select: component ucx not selected / finalized
[duvel:3794237] mca: base: close: component v closed
[duvel:3794237] mca: base: close: unloading component v
[duvel:3794235] select: component ucx not selected / finalized
[duvel:3794235] mca: base: close: component v closed
[duvel:3794235] mca: base: close: unloading component v
[duvel:3794237] mca: base: close: component monitoring closed
[duvel:3794237] mca: base: close: unloading component monitoring
[duvel:3794235] mca: base: close: component monitoring closed
[duvel:3794235] mca: base: close: unloading component monitoring
[duvel:3794240] mca: base: close: component cm closed
[duvel:3794240] mca: base: close: unloading component cm
[duvel:3794241] mca: base: close: component ucx closed
[duvel:3794241] mca: base: close: unloading component ucx
[duvel:3794237] mca: base: close: component cm closed
[duvel:3794237] mca: base: close: unloading component cm
[duvel:3794235] mca: base: close: component cm closed
[duvel:3794235] mca: base: close: unloading component cm
[duvel:3794240] mca: base: close: component ucx closed
[duvel:3794240] mca: base: close: unloading component ucx
[duvel:3794237] mca: base: close: component ucx closed
[duvel:3794237] mca: base: close: unloading component ucx
[duvel:3794235] mca: base: close: component ucx closed
[duvel:3794235] mca: base: close: unloading component ucx
[duvel:3794240] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794237] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794239] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794236] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794234] check:select: PML check not necessary on self
[duvel:3794241] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794238] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
[duvel:3794235] check:select: checking my pml ob1 against process [[62116,1],0] pml ob1
Process 0 sending 10 to 1, tag 201 (8 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
Process 5 exiting
Process 4 exiting
Process 6 exiting
Process 2 exiting
Process 3 exiting
Process 7 exiting
[duvel:3794235] mca: base: close: component ob1 closed
[duvel:3794235] mca: base: close: unloading component ob1
[duvel:3794240] mca: base: close: component ob1 closed
[duvel:3794240] mca: base: close: unloading component ob1
[duvel:3794236] mca: base: close: component ob1 closed
[duvel:3794236] mca: base: close: unloading component ob1
[duvel:3794239] mca: base: close: component ob1 closed
[duvel:3794239] mca: base: close: unloading component ob1
[duvel:3794241] mca: base: close: component ob1 closed
[duvel:3794241] mca: base: close: unloading component ob1
[duvel:3794237] mca: base: close: component ob1 closed
[duvel:3794237] mca: base: close: unloading component ob1
[duvel:3794234] mca: base: close: component ob1 closed
[duvel:3794234] mca: base: close: unloading component ob1
[duvel:3794238] mca: base: close: component ob1 closed
[duvel:3794238] mca: base: close: unloading component ob1
[duvel:3794230] 7 more processes have sent help message help-mtl-ofi.txt / OFI call fail
[duvel:3794230] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

@bosilca
Copy link
Member

bosilca commented Apr 11, 2024

In both cases the CM PML is selected. However, on wired network all processes leave before completing the CM selection with the error:

PSM3 can't open nic unit: 1 (err=23)

So, back to case 1, the OFI provider failed to correctly initialize the wired network.

@hppritcha
Copy link
Member

although it seems kind of old, this may be related to ofiwg/libfabric#6710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants