[BUG] Segmentation fault when running analysis using multiple CPUs #411

swutke · 2023-11-23T09:30:27Z

Describe the bug
Running a DEC analysis as slurm job on HPC fails when using multiple CPUs (#SBATCH --ntasks=4). The run finishes when using only one CPU (however, very slowly).

To Reproduce
the sbatch script looks like this:

#!/bin/bash -l
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=1000

#[load RevBayes module]

srun rb-mpi scripts/run_simple.Rev

Screenshots
Error message:

[r07c02:2173017:0:2173017] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace (tid:2173017) ====
 0  /appl/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f8f8a559ff4]
 1  /appl/opt/ucx/1.12.1/lib/libucs.so.0(+0x2d1ec) [0x7f8f8a55a1ec]
 2  /appl/opt/ucx/1.12.1/lib/libucs.so.0(+0x2d498) [0x7f8f8a55a498]
 3  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0xeb3ebd]
 [....]
38  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0x66553c]
39  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9011918cf3]
40  /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi() [0x69e12e]
=================================
[r07c02:2173017] *** Process received signal ***
[r07c02:2173017] Signal: Segmentation fault (11)
[r07c02:2173017] Signal code:  (-6)
[r07c02:2173017] Failing at address: 0x98a34500212859
[r07c02:2173017] [ 0] /lib64/libpthread.so.0(+0x12ce0)[0x7f9012251ce0]
[r07c02:2173017] [ 1] /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi[0xeb3ebd]
[...]
[r07c02:2173017] [29] /appl/soft/bio/revbayes/gcc_11.3.0/revbayes-1.2.2/bin/rb-mpi[0x6ab2dc]
[r07c02:2173017] *** End of error message ***
srun: error: r07c02: task 0: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=19455469.0
srun: error: r07c03: task 1: Terminated
srun: Force Terminated StepId=19455469.0

Computer info
computing cluster, happens with both RevBayes v1.2.0 & v1.2.2 (these are the available RevBayes modules on the HPC), running as slurm job

Script and data files:
run_simple.zip

The text was updated successfully, but these errors were encountered:

bredelings · 2023-11-24T14:57:53Z

How about if you use a smaller number of CPUs? Maybe 2 CPUs?

bredelings · 2023-11-24T15:29:42Z

I was able to reproduce this locally using the following command:

mpirun -c 2 rb-mpi-debug-O run_simple.Rev

I was able to get the two processes to run under gdb with this command:

mpirun -c 2 xterm -e gdbtui --args rb-mpi-debug-O run_simple.Rev

The segmentation fault occurs at the end of PhyloCTMCClado<NaturalNumberState>::computeInternalNodeLikelihood( ) when it tries to destroy the map eventMapProbs.

bredelings · 2023-11-24T15:54:45Z

I built a version with mpi + address-sanitizer using this command:

meson setup ../git gcc-13-mpi-sanitize-debug-O --prefix=/home/bredelings/Devel/revbayes/local/gcc-13-mpi-sanitize-debug-O --buildtype=debug -Db_sanitize=address -Doptimization=1 -Dmpi=true

The address sanitizer caught a memory access error here:

revbayes/src/core/distributions/phylogenetics/substitution/PhyloCTMCClado.h

Line 422 in a4a40ae

const double pl = *(p_site_mixture_left + c2);

bredelings · 2023-11-24T16:37:26Z

It looks like [c1,c2,c3] = [1,1,1] here, and [1,1,1] is the first entry of eventMapProbs.

Also num_site_patterns is 1.

This doesn't look like a sampled ancestor node.

swutke · 2023-11-28T15:54:14Z

How about if you use a smaller number of CPUs? Maybe 2 CPUs?

Hi,
yes, I tried that as well, but same error :(

swutke · 2023-11-28T15:55:45Z

I built a version with mpi + address-sanitizer using this command:
meson setup ../git gcc-13-mpi-sanitize-debug-O --prefix=/home/bredelings/Devel/revbayes/local/gcc-13-mpi-sanitize-debug-O --buildtype=debug -Db_sanitize=address -Doptimization=1 -Dmpi=true
The address sanitizer caught a memory access error here:

revbayes/src/core/distributions/phylogenetics/substitution/PhyloCTMCClado.h

Line 422 in a4a40ae

const double pl = *(p_site_mixture_left + c2);

I am afraid I don't fully understand these steps. Does it mean I should increase the memory for the job?

swutke · 2023-11-28T15:57:55Z

It looks like [c1,c2,c3] = [1,1,1] here, and [1,1,1] is the first entry of eventMapProbs.

Also num_site_patterns is 1.

This doesn't look like a sampled ancestor node.

So, the input tree is the problem? It was estimated with BEAST2 using the FBD model (SA package).

swutke · 2023-12-18T12:20:29Z

Hi, thank you so much for your comments but I am still unsure what the actual problem is. I would be grateful if you could explain what I should do differently?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segmentation fault when running analysis using multiple CPUs #411

[BUG] Segmentation fault when running analysis using multiple CPUs #411

swutke commented Nov 23, 2023 •

edited

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

swutke commented Nov 28, 2023

swutke commented Nov 28, 2023 •

edited

swutke commented Nov 28, 2023

swutke commented Dec 18, 2023

[BUG] Segmentation fault when running analysis using multiple CPUs #411

[BUG] Segmentation fault when running analysis using multiple CPUs #411

Comments

swutke commented Nov 23, 2023 • edited

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

bredelings commented Nov 24, 2023

swutke commented Nov 28, 2023

swutke commented Nov 28, 2023 • edited

swutke commented Nov 28, 2023

swutke commented Dec 18, 2023

swutke commented Nov 23, 2023 •

edited

swutke commented Nov 28, 2023 •

edited