Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segmentation fault when running analysis using multiple CPUs #411

Open
swutke opened this issue Nov 23, 2023 · 8 comments
Open

[BUG] Segmentation fault when running analysis using multiple CPUs #411

swutke opened this issue Nov 23, 2023 · 8 comments

Comments

@swutke
Copy link

swutke commented Nov 23, 2023

Describe the bug
Running a DEC analysis as slurm job on HPC fails when using multiple CPUs (#SBATCH --ntasks=4). The run finishes when using only one CPU (however, very slowly).

To Reproduce
the sbatch script looks like this:

#!/bin/bash -l
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=1000

#[load RevBayes module]

srun rb-mpi scripts/run_simple.Rev

Screenshots
Error message:

[r07c02:2173017:0:2173017] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2173017) ====
 0  /appl/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f8f8a559ff4]
 1  /appl/opt/ucx/1.12.1/lib/libucs.so.0(+0x2d1ec) [0x7f8f8a55a1ec]
 2  /appl/opt/ucx/1.12.1/lib/libucs.so.0(+0x2d498) [0x7f8f8a55a498]
 3  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0xeb3ebd]
 [....]
38  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0x66553c]
39  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9011918cf3]
40  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0x69e12e]
=================================
[r07c02:2173017] *** Process received signal ***
[r07c02:2173017] Signal: Segmentation fault (11)
[r07c02:2173017] Signal code:  (-6)
[r07c02:2173017] Failing at address: 0x98a34500212859
[r07c02:2173017] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x7f9012251ce0]
[r07c02:2173017] [ 1] /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi[0xeb3ebd]
[...]
[r07c02:2173017] [29] /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi[0x6ab2dc]
[r07c02:2173017] *** End of error message ***
srun: error: r07c02: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=19455469.0
srun: error: r07c03: task 1: Terminated
srun: Force Terminated StepId=19455469.0

Computer info
computing cluster, happens with both RevBayes v1.2.0 & v1.2.2 (these are the available RevBayes modules on the HPC), running as slurm job

Script and data files:
run_simple.zip

@bredelings
Copy link
Contributor

How about if you use a smaller number of CPUs? Maybe 2 CPUs?

@bredelings
Copy link
Contributor

I was able to reproduce this locally using the following command:

mpirun -c 2 rb-mpi-debug-O run_simple.Rev

I was able to get the two processes to run under gdb with this command:

mpirun -c 2 xterm -e gdbtui --args rb-mpi-debug-O run_simple.Rev

The segmentation fault occurs at the end of PhyloCTMCClado<NaturalNumberState>::computeInternalNodeLikelihood( ) when it tries to destroy the map eventMapProbs.

@bredelings
Copy link
Contributor

I built a version with mpi + address-sanitizer using this command:

meson setup ../git gcc-13-mpi-sanitize-debug-O --prefix=/home/bredelings/Devel/revbayes/local/gcc-13-mpi-sanitize-debug-O --buildtype=debug -Db_sanitize=address -Doptimization=1 -Dmpi=true

The address sanitizer caught a memory access error here:

const double pl = *(p_site_mixture_left + c2);

@bredelings
Copy link
Contributor

It looks like [c1,c2,c3] = [1,1,1] here, and [1,1,1] is the first entry of eventMapProbs.

Also num_site_patterns is 1.

This doesn't look like a sampled ancestor node.

@swutke
Copy link
Author

swutke commented Nov 28, 2023

How about if you use a smaller number of CPUs? Maybe 2 CPUs?

Hi,
yes, I tried that as well, but same error :(

@swutke
Copy link
Author

swutke commented Nov 28, 2023

I built a version with mpi + address-sanitizer using this command:

meson setup ../git gcc-13-mpi-sanitize-debug-O --prefix=/home/bredelings/Devel/revbayes/local/gcc-13-mpi-sanitize-debug-O --buildtype=debug -Db_sanitize=address -Doptimization=1 -Dmpi=true

The address sanitizer caught a memory access error here:

const double pl = *(p_site_mixture_left + c2);

I am afraid I don't fully understand these steps. Does it mean I should increase the memory for the job?

@swutke
Copy link
Author

swutke commented Nov 28, 2023

It looks like [c1,c2,c3] = [1,1,1] here, and [1,1,1] is the first entry of eventMapProbs.

Also num_site_patterns is 1.

This doesn't look like a sampled ancestor node.

So, the input tree is the problem? It was estimated with BEAST2 using the FBD model (SA package).

@swutke
Copy link
Author

swutke commented Dec 18, 2023

Hi, thank you so much for your comments but I am still unsure what the actual problem is. I would be grateful if you could explain what I should do differently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants