Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue detecting a profile on single node system #28

Open
thomas-robinson opened this issue Aug 10, 2021 · 5 comments
Open

Issue detecting a profile on single node system #28

thomas-robinson opened this issue Aug 10, 2021 · 5 comments
Assignees

Comments

@thomas-robinson
Copy link

I am trying to set profile detect to set up my profile. Here is my sample Fortran program that sums the ranks (run with 11 ranks isum = 55):

PROGRAM hello_world_mpi
include 'mpif.h'

integer process_Rank, size_Of_Cluster, ierror
integer root_rank, isum

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size_Of_Cluster, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, process_Rank, ierror)


root_rank = 0
call  MPI_Reduce(process_rank, isum, 1, MPI_INT, MPI_SUM, root_rank, MPI_COMM_WORLD, ierror);
call MPI_bcast (isum, 1, MPI_INTEGER, root_rank, MPI_COMM_WORLD, ierror)

print *, 'Hello World from process: ', process_Rank, 'of ', size_Of_Cluster, 'sum = ', isum
end program

I compiled it with mpiifort

$ mpiifort -v
mpiifort for the Intel(R) MPI Library 2019 Update 9 for Linux*
Copyright 2003-2020, Intel Corporation.
ifort version 19.1.3.304

This program runs with mpirun

$ mpirun --version
Intel(R) MPI Library for Linux* OS, Version 2019 Update 9 Build 20200923 (id: abd58e492)
Copyright 2003-2020, Intel Corporation.
$ mpirun -np 11  -hosts lscamd50-d.gfdl.noaa.gov ./test.x

Here is my e4s-cl command

$ e4s-cl profile detect -p am4Run mpirun -np 11  -hosts lscamd50-d.gfdl.noaa.gov ./test.x
Failed to determine necessary libraries.

The advice in the documentation is to specify multiple hosts (https://e4s-project.github.io/e4s-cl/reference/profiles/detect.html#profile-detect), but this is a single node system with 128 cores. How can I get all of the libraries needed to run on my system?

@sameershende
Copy link
Collaborator

sameershende commented Aug 10, 2021 via email

@spoutn1k
Copy link
Collaborator

spoutn1k commented Aug 10, 2021 via email

@spoutn1k spoutn1k self-assigned this Aug 10, 2021
@spoutn1k
Copy link
Collaborator

spoutn1k commented Aug 10, 2021

On further testing, it seems like the issue does not originate in e4s-cl. The library detection is done by leveraging the ptrace capabilities and recording all open and openat syscalls.

This seems to fail with the binaries created by mpiifort. The exact same error happens when using strace:

$ mpirun -np 2 strace -e open,openat ./issue-28
[...]
ofi_mlx_hcoll.dat", O_RDONLY) = 67
) = 67
 Hello World from process:            0 of            2 sum =            1
 Hello World from process:            1 of            2 sum =            1
+++ exited with 0 +++

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1345869 RUNNING AT illyad
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

I will look into this. I am positive profile detection worked with Fortran binaries compiled with other MPI flavours, so this must be an Intel quirk.

A profile created with a C program can ususally also be used after adding the fortran (libmpifort.so) libraries from the install directory.

e4s-cl init --profile am4Run
e4s-cl profile edit --add-libraries </path/to/libmpifort.so> ...

@thomas-robinson
Copy link
Author

Sorry for the delay and long post. I couldn't post before for some reason.
Yes, I ran e4s-cl init. I forgot to mention that

$ e4s-cl init
The target launcher /opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/bin/mpirun uses a single host by default, which may tamper with the library discovery. Consider running `e4s-cl profile detect` using mpirun specifying multiple hosts.
$ e4s-cl profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x
Failed to determine necessary libraries.

I ran the debug and got this:

$ e4s-cl -v profile detect -p am4Run mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov ./test.x 
[Debug] Arguments: Namespace(command='profile', options=['detect', '-p', 'am4Run', 'mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'], dry_run=None, slave=None, verbose='DEBUG')
[Debug] Verbosity level: DEBUG
[Debug] e4s-cl profile args: Namespace(subcommand='detect', options=['-p', 'am4Run', 'mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'])
[Debug] e4s-cl profile detect args: Namespace(profile_name='am4Run', cmd=['mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', './test.x'])
[Debug] Creating subprocess: mpirun -np 11 -hosts lscamd50-d.gfdl.noaa.gov /home/Thomas.Robinson/e4s-cl/bin/e4s-cl --slave profile detect ./test.x
[Debug] Hello World from process:            5 of           11 sum =           55
 Hello World from process:            3 of           11 sum =           55
 Hello World from process:            0 of           11 sum =           55
 Hello World from process:            1 of           11 sum =           55
 Hello World from process:            4 of           11 sum =           55
 Hello World from process:            6 of           11 sum =           55
 Hello World from process:            7 of           11 sum =           55
 Hello World from process:            8 of           11 sum =           55
 Hello World from process:            2 of           11 sum =           55
 Hello World from process:           10 of           11 sum =           55
 Hello World from process:            9 of           11 sum =           55
{"files": {"__type": "set", "__list": ["/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/lib/release/libmpi.so.12", "/etc/libnl/classid", "/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/etc/tuning_generic_shm-ofi.dat"]}, "libraries": {"__type": "set", "__list": ["/lib64/libpsm2.so.2", "/lib64/libnl-route-3.so.200", "/lib64/libc.so.6", "/lib64/libnl-3.so.200", "/lib64/libfabric.so.1", "/lib64/libgcc_s.so.1", "/lib64/libnuma.so.1", "/lib64/libm.so.6", "/lib64/librt.so.1", "/opt/intel/2020_up3/compilers_and_libraries/linux/mpi/intel64/lib/libmpifort.so.12", "/lib64/libdl.so.2", "/lib64/libpthread.so.0", "/lib64/libibverbs.so.1", "/lib64/librdmacm.so.1", "/lib64/libefa.so.1"]}}

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 2127227 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 2127229 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 3 PID 2127230 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 4 PID 2127231 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 5 PID 2127232 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 6 PID 2127233 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 7 PID 2127234 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 8 PID 2127235 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 9 PID 2127236 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 2127237 RUNNING AT lscamd50-d.gfdl.noaa.gov
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================
[Debug] ['mpirun', '-np', '11', '-hosts', 'lscamd50-d.gfdl.noaa.gov', '/home/Thomas.Robinson/e4s-cl/bin/e4s-cl', '--slave', 'profile', 'detect', './test.x'] returned 255
Failed to determine necessary libraries.

If I change to a C program, do you think it will work then? What libraries do I need to link in?

I tried to launch with the default profile created, but I get a different error

$ e4s-cl launch --backend singularity --image am4_2021.03_ubuntu_intel.sif mpirun -n 48 ./2021.03_run.sh
Using selected profile default-137215bba819ae9d045d5b51c339b35e38c270bdafcf5d6a9181ae2e3640502d
2137479 on lscamd50-d.gfdl.noaa.gov: ./2021.03_run.sh: error while loading shared libraries: ./2021.03_run.sh: invalid ELF header

Maybe this is a different issue.

@spoutn1k
Copy link
Collaborator

Github's servers were down for a little while, I wasn't able to edit either !

I tested the profile detection with C programs and it should work. This is just to detect the libraries, you can run Fortran binaries with the tool and it should work as ptrace is not invoked during execution.

This is another issue unfortunately. e4s-cl is trying to override the container's dynamic linker, but the end command being a shell script confuses the linker, as it expects a binary. Depending on the contents of 2021.03_run.sh, you can create a shell script with all the setup steps, and pass on the CLI the final binary call. e4s-cl can source scripts before execution in the container.

Here I add all but the last line to a setup script, and pass it to e4s-cl by editing the profile:

head -n -1 2021.03_run.sh > setup.sh
e4s-cl profile edit --source $PWD/setup.sh --backend singularity --image am4_2021.03_ubuntu_intel.sif
e4s-cl launch mpirun -n 48 `tail -n 1 2021.03_run.sh`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants