Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openMPI not runnging with Sigansl codes *** End of error message*** #12435

Open
Minyoung-sss opened this issue Mar 26, 2024 · 7 comments
Open

Comments

@Minyoung-sss
Copy link

Minyoung-sss commented Mar 26, 2024

Hello

I installed openMPI version 4.1.2. and I execute MAKER ver 3.1.2. but it stops immediately with this error
(and I executed anaconda3 env name of 'MAKER')

*** end of error message ***
sigterm received
sigterm thread
***process received signal ***
singal : segmentation fault (11)
signal code : Address not mapped (1)
Failing at address: 0x5a4
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f26eb242520]
[1 ] / home/kucmb/anaconda3/envs/MAKER/bin/../LIB/PERL5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38) [0x7f26eb6ff698]
[2 ] /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f26eb242520]
[3 ] /lib/x86_64-linux-gnu/libc.so.6(_poll+0x4f) [0x726eb318bcf]
[4 ] /lib//x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309) [0x7f26eb0ed309]
[5 ] /lib//x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1) [0x7f26eb0e8921]
[6 ] /usr/local/lib/libopen-pal.so.40(+0x37e46) [0x7f26eb4bfe46]
[7 ] /lib//x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f26eb294ac3]
[8 ] /lib//x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f26eb326850]
*** End of error message ***

-------------------------------------------------------------------------------------
Primary job terminated normally. but 1 process returned
a no-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------------------------------
Perl excited with active threads:
             1 running and unjoined
             0 finished and unjoined
             0 running and detached

Screenshot from 2024-03-26 17-07-40

In addition to, when I used this command '--mca btl ^openlib' , this error came out
Screenshot from 2024-03-26 17-00-48

What mean? I can't find this error what kind of and causation.

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v4.1.2

I already executed mpirun MAKER using MPI v4.1.6. But running stop immediatly with same error.
So I checked already installed version of difference MPI in my computer.
I found ubuntu package 'openmpi-bin' and 'openmpi-common' version 4.1.2.
I think this is a causation and I changed open MPI downgraded version 4.1.2

Is that right??
I am not good at knowing ubuntu and MPI because I have started studying bioinformatics one month ago.

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

sudo ./configure --prefix=/usr/local --enable-mpirun-prefix-by-default
sudo make
sudo make install
![Screenshot from 2024-03-26 16-26-30](https://github.com/open-mpi/ompi/assets/153480806/9bc786d3-77a3-4428-8b94-b6722ad6d5c3)

https://chat.stackoverflow.com/rooms/153365/discussion-between-imworsethanyou-and-gilles-gouaillardet

vi ~/. bashrc
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
source ~/.bashrc
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

marker install
perl Build.PL
  /bin/mpicc (location of mpicc)
  /usr/local/include (location of mip.h)
install

Details of the problem

shell$ mpirun -np 12 maker maker_opts.ctl maker_exe.ctl maker_bopts.ctl

I don't know why same error appear with running MPI stop
Please help me.

Best Regards

Thank you for reading

@lrbison
Copy link
Contributor

lrbison commented Mar 26, 2024

The error message suggests you add RDMAV_FORK_SAFE=1. Have you tried adding the following to your mpiexec line:

mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

Additionally, your mpiexec output indicates you are using EFA via libfabric, but your config.log output indicates it will not be built with libfabric support. Are you sure you are running the mpi version you think you are?

@Minyoung-sss
Copy link
Author

Thank you for your kindly answer.

I will try this command agian.
mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

but I have questions this command '-x RDMAV_FORK_SAFE=1'
Is that mean related to RDMA environment and this is causative to ERROR SINGAL 11?
I searched about this error in Google, so I found this error related to coumputer memory and defalut is '0' (not)

and I don't know my mpiexec output using EFA via libfabric before your answering. LOL
I don't configure anything about EFA and libfabric. It is right that it will not be with libfabric support.

So, I check running openmpi version again and I confirm the MAKER site which I want to run using open MPI.

$ mpirun --version
mpirun (Open MPI) 4.1.2
$ which mpirun
/usr/local/bin/mpirun

MAKER program can be used any version open MPI or MPICH.

Do you think I should change my MPI version? or I should build with libfabric suppor?

If I need to re-install different MPI version, how can I remove completely MPI old version?
or
If I should build with libfavric support , how can I build support?

Thank you for helping rookie, who is lacking a lot

Regards.

@Minyoung-sss
Copy link
Author

In addtion to my unbuntu package openmpi version is 4.1.2
Screenshot from 2024-03-27 10-32-09

If I need to re-install different MPI version, remove them also?

I used reference this wepsite when I firstly installed open MPI. so I think this packages need to install MPI.

@ggouaillardet
Copy link
Contributor

Why don't you try the workaround first?

Note you have to use mpirun from the library that was used to build your application.

@Minyoung-sss
Copy link
Author

Minyoung-sss commented Mar 27, 2024

I try this command
mpiexec -x RDMAV_FORK_SAFE=1 -np 12 ...

However, same error came out....

(MAKER) kucmb@kucmb-System-Product-Name:~/maker$ mpiexec -x RDMAV_FORK_SAFE=1 -np 12 maker maker_exe.ctl maker_opts.ctl maker_bopts.ctl
STATUS: Parsing control files...
STATUS: Processing and indexing input FASTA files...
[kucmb-System-Product-Name:401470] *** Process received signal ***
[kucmb-System-Product-Name:401470] Signal: Segmentation fault (11)
[kucmb-System-Product-Name:401470] Signal code: Address not mapped (1)
[kucmb-System-Product-Name:401470] Failing at address: 0x5a4
[kucmb-System-Product-Name:401470] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1e33a42520]
[kucmb-System-Product-Name:401470] [ 1] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38)[0x7f1e33eff698]
[kucmb-System-Product-Name:401470] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f1e33a42520]
[kucmb-System-Product-Name:401470] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__poll+0x4f)[0x7f1e33b18bcf]
[kucmb-System-Product-Name:401470] [ 4] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309)[0x7f1e339d3309]
[kucmb-System-Product-Name:401470] [ 5] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1)[0x7f1e339ce921]
[kucmb-System-Product-Name:401470] [ 6] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x2d646)[0x7f1e33d7a646]
[kucmb-System-Product-Name:401470] [ 7] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f1e33a94ac3]
[kucmb-System-Product-Name:401470] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f1e33b26850]
[kucmb-System-Product-Name:401470] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
SIGTERM received
--------------------------------------------------------------------------
mpiexec noticed that process rank 11 with PID 0 on node kucmb-System-Product-Name exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

But I tried previous command again by mistake
mpiexec -np 12 maker...
it is run.....

I don't know why this command run. I haven't changed anything.
I'll see if things going on right....
I feel like this run maybe new problem come up......

Thank you so much.

@Minyoung-sss
Copy link
Author

Hello. everyone.

My computer have executed this command for 3 days well, but it suddenly stopped at today morning

#-------------------------------#
SIGTERM thread
SIGTERM received
deleted:130 hits
collecting blastn reports
SIGTERM thread
[kucmb-System-Product-Name:402259] *** Process received signal ***
[kucmb-System-Product-Name:402259] Signal: Segmentation fault (11)
[kucmb-System-Product-Name:402259] Signal code: Address not mapped (1)
[kucmb-System-Product-Name:402259] Failing at address: 0x5a4
[kucmb-System-Product-Name:402259] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 1] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler3+0x38)[0x7f7159eff698]
[kucmb-System-Product-Name:402259] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 3] /home/kucmb/anaconda3/envs/MAKER/bin/../lib/perl5/5.32/core_perl/CORE/libperl.so(Perl_csighandler+0x0)[0x7f7159eff710]
[kucmb-System-Product-Name:402259] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f7159a42520]
[kucmb-System-Product-Name:402259] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__poll+0x4f)[0x7f7159b18bcf]
[kucmb-System-Product-Name:402259] [ 6] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(+0x24309)[0x7f7159c76309]
[kucmb-System-Product-Name:402259] [ 7] /lib/x86_64-linux-gnu/libevent_core-2.1.so.7(event_base_loop+0x2a1)[0x7f7159c71921]
[kucmb-System-Product-Name:402259] [ 8] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x2d646)[0x7f715a1fc646]
[kucmb-System-Product-Name:402259] [ 9] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f7159a94ac3]
[kucmb-System-Product-Name:402259] [10] /lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x7f7159b26850]
[kucmb-System-Product-Name:402259] *** End of error message ***
running  blast search.

#-------------------------------#
deleted:90 hits
SIGTERM thread
SIGTERM received
--------------------------------------------------------------------------
mpiexec noticed that process rank 8 with PID 0 on node kucmb-System-Product-Name exited on signal 11 (Segmentation fault).

same error again..
Is it a disk capacity problem?
In the morning, I got a notification that the capacity was insufficient.

Please give me any help

Thank you

@lrbison
Copy link
Contributor

lrbison commented Apr 29, 2024

There is not enough information here to help debug the problem. I suspect you are still mixing installation and runtime versions.

I suggest you do the following:

  • pick a new install location (to remove any doubt you are running the new version, perhaps $HOME/ompi_test)
  • remove any LD_PRELOAD options from bashrc, and old PATH and LD_LIBRARY_PATH options.
  • compile Open MPI from source with several options to ./configure:
    • --enable-debug (for better backtrace)
    • --prefix=$HOME/ompi_test (or whichever path you decided on)
    • --enable-mpirun-prefix-by-default (so runtime uses the same libraries)
  • Set PATH=$HOME/ompi_test/bin:$PATH" and LD_LIBRARY_PATH=$HOME/ompi_test/lib:$LD_LIBRARY_PATH` in .bashrc. Logout, log back in to apply.
  • rebuild your application if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants