Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hydra_pmi_proxy encounters libcudart.so linker error on Ubuntu22 #6986

Open
BKitor opened this issue Apr 19, 2024 · 2 comments
Open

hydra_pmi_proxy encounters libcudart.so linker error on Ubuntu22 #6986

BKitor opened this issue Apr 19, 2024 · 2 comments

Comments

@BKitor
Copy link

BKitor commented Apr 19, 2024

I ran into an issue with the ssh launcher when MPICH is configured to use CUDA on Ubuntu22.
hydra_pmi_proxy runs into a linker error when it can't find libcudart.so:

user@frogfish:~$ mpirun -n 2 -hosts frogfish,kingfish -launcher ssh echo test
test
/usr/local/bin/hydra_pmi_proxy: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
[mpiexec@frogfish] ui_cmd_cb (mpiexec/pmiserv_pmci.c:51): Launch proxy failed.
[mpiexec@frogfish] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@frogfish] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec@frogfish] main (mpiexec/mpiexec.c:260): process manager error waiting for completion

Workaround:

Since it's a linker error with hydra_pmi_proxy, the intuitive places to set LD_LIBRARY_PATH don't work.
Ubuntu22 restricts bash environment variables when running in non-interactive mode, and the ssh launcher spawns the hydra_pmi_proxy in non-interactive mode.
My workaround was to add the LD_LIBRARY_PATH export at the top of ~/.bashrc, before the interactive/non-interactive mode check.

Indide ~/.bashrc:

# make sure cuda libs are expored, even in non-interactive mode. 
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH


# If not running interactively, don't do anything
case $- in
    *i*) ;;
      *) return;;
esac

fwiw, MPICH+CUDA works out of the box on Rocky9, this is Ubuntu-specific.
I doubt this is a good long term fix, it would probably be better if hydra_pmi_proxy had libcudart.so set in it's RPATH, or some maybe some other form of linker magic.

@raffenet
Copy link
Contributor

Thanks for the report! Hydra is getting the CUDA dependency from the embedded convenience library (MPL) that contains our GPU wrappers used in MPICH. Since none of that GPU code is actually used in Hydra, another solution might be to link Hydra with an MPL without any GPU dependencies. That or we move to dlopen/dlsym for GPU support in MPL, which shouldn't get triggered by the Hydra processes.

@raffenet
Copy link
Contributor

raffenet commented Apr 22, 2024

@BKitor another workaround you can try is to build Hydra standalone without the CUDA dependency:

cd src/pm/hydra
./configure --prefix=<path/to/install> --without-cuda
make -j
make install

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants