Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/datatype: do not use installed components #12285

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ggouaillardet
Copy link
Contributor

By default, the tests will use the components in the build directory but also in the installation directory if any. That can cause some bizarre issues when they are not compatible.

Thanks Kook Jin Noh for the initial bug report.

Refs. #12282

By default, the tests will use the components in the build directory
but also in the installation directory if any. That can cause some
bizarre issues when they are not compatible.

Thanks Kook Jin Noh for the initial bug report.

Refs. open-mpi#12282

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've honestly had this issue for years, and always made sure to just not have an installed OMPI available. I kinda like this solution better.

@bosilca
Copy link
Member

bosilca commented Jan 29, 2024

It took me a while to understand what this change is doing on the tests. We should document this change better.

Btw, how are the tests using the components in the build directory ? We load them using the install_dir, we don't even know the build path for each component. If we include everything into libmpi.so then this should not be a problem, but if we build shared library components we default to first opening the install path location components.

@ggouaillardet
Copy link
Contributor Author

Well, shame on me!

the tests do not use the components from the build directory, and with this patch, it does not open any component at all, which happens to be ok for test/datatype. It still tries to open the openmpi-mca-params.conf from the install directory. export OMPI_MCA_mca_base_param_files= causes mpirun to crash (likely a bug) but using a bogus value can do the trick.

I will double check that tomorrow and fix the commit message.

@jsquyres
Copy link
Member

the tests do not use the components from the build directory,

It's quite possible that we don't have any make check tests that use MCA components. I think most (all?) of our make check stuff was meant to be for the "core" library functionality -- at least back in the beginning of the project. That attitude may or may not be a good one to have all these years later, but given that we haven't invested much in our make check infrastructure, it's quite possible (likely) that it still reflects our attitudes from way back early in the project.

@ggouaillardet
Copy link
Contributor Author

Thanks @rhc54 for fixing the param file issue in openpmix/openpmix#3273, I will backport the fix to Open MPI.

@bosilca
Copy link
Member

bosilca commented Feb 1, 2024

I am not completely convinced by this. Preventing the use of components from an old installation (especially one that is not binary compatible internally), is a desired features, but this way of doing it is a little too strict.

On a default build (all components included in the libmpi.so) this approach works by preventing OMPI from loading unnecessary/older components. But OMPI will be able to use all the static components that are part of the MPI shared library.

But in the case where we build with full support for shared libraries, aka. all components are in separate .so, setting an empty MCA component path, we will prevent OMPI from loading any components. Even the ones that are required (if, accelerator, to name just a few). As a result, many tests (or at least all datatype tests) will segfault, as they require at least the NULL accelerator component.

The issue #12282 talks about OMPI 5.x loading components from an OMPI 4.1 installation. I don't understand how that can happen because I was under the impression that we prevent the opening of components with an incorrect version.

@ggouaillardet
Copy link
Contributor Author

I just found that the v4 and v5 series use the same MCA version (2.1.0) !
I do not know whether this is intended or something we forgot.

The issue is triggered by the mca_pmix_ext3x.so component.
It registers the library_version parameter, but mca_pmix_ext3x.so is dlclose()'d at the end before destructing the MCA parameters, and this is when the crash occurs.

My theory is that since PMIx is a first class citizen from the v5 series, no PMIx component is selected (nor expected), and hence the library_version parameter is not removed before dlclose() is invoked (also, I guess that should occur in the initialization phase instead of the finalization).
A possible fix to that is to flag the pmix framework as "component-less" and do not register components (or do not even open them in the first place). FWIW, I made an ugly proof-of-concept and it worked.

You are definitely correct about static vs dynamic components.

In the example you described, I guess make check would always fail unless make install was previously invoked, so I would always build the NULL accelerator component statically. As a side effect, it would not invalidate the approach of this PR.

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2024

The issue is triggered by the mca_pmix_ext3x.so component. It registers the library_version parameter, but mca_pmix_ext3x.so is dlclose()'d at the end before destructing the MCA parameters, and this is when the crash occurs.

Hmmm...I wonder if it would make sense to add a check in the ext3x component so it deregisters its params if dlopen of the PMIx library fails? In other words, check to see if it finds the PMIx lib and then deselect itself if not?

Won't help the OMPI v4 stuff that is out in the wild, but maybe help down the road?

FWIW: the MCA version didn't change because there was no modification to the MCA interface itself, so the plugins are in fact compatible.

@ggouaillardet
Copy link
Contributor Author

I do not think the pmix/ext3x component (tries to) dlopen() the PMIx library.

From the point of view of Open MPI v5, the pmix/ext3x component is successfully loaded, but it never gets selected (because the pmix framework does not need components any more). And because to pmix component evers gets selected, it is never deselected until the very end.

@rhc54
Copy link
Contributor

rhc54 commented Feb 1, 2024

Given that fixing OMPI v4 won't help with all the outstanding installations, maybe the right (but perhaps ugly) solution is to add some in the OMPI v5 mca base that just says "if you find a PMIx component, ignore it". I believe that is the solution you prototyped, yes? Makes the best sense to me!

@bosilca
Copy link
Member

bosilca commented Feb 1, 2024

One of the root causes of this is that we dlclose a component where we successfully called _init, so we basically never give the component a chance to cleanup behind them. Calling _close will allow the the deregistration of MCA variables, and avoid the segfault.

However, other issues persist:

  1. Why did OMPI 5.l0 loaded a component for a framework that does not exists in this version ? As @ggouaillardet indicated above PMIX is now a base component, there is no reason to try to slurp in other PMIX like components.
  2. The current protection for loading MCA modules is too weak (checking only for component ABI does not guarantee that these components will find everything they need from the library). At least the check should be symmetric, i.e make sure the component was built for the right MPI library. Then the question become "how much we limit the validity of our components ?"

@ggouaillardet this approach will not solve the problem as non-yet-installed components. In fact the only solution to it would be to always build the accelerator_null component statically. We'll chat about this on our slack channel and come back with a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants