Ascent appears to cause a segfault #3873

robertsawko · 2024-04-05T15:52:13Z

Hello,

Together with two colleagues we were using AMReX built-in Ascent integration. We religiously followed the three blueprint tutorials and implemented a function which uses our finest mesh to produce a mesh blueprint with SingleLevelToBlueprint and pass it to Ascent with some actions to execute.

The simulations run okay, generate images as expected, but end up with a segfault:

AMReX (v2024-22-ga611acdeafed) finalized
[sqg1cintr16:55480:0:55480] Caught signal 11 (Segmentation fault: <unknown si_code> at address 0x7ffe7df9bcb0)
==== backtrace ====
 0 0x0000000000036280 killpg()  ???:0
===================
Segmentation fault (core dumped)

This is strange for me in many ways. Firstly, it looks like AMReX actually finalizes fine. I am used to segfaults being quite fatal to running programs.

The problem happens in parallel and in serial. I cannot reproduce with a heat equation tutorial, but one of my colleague reports that he saw something like this in 2D heat equation too.

The code itself is not even very interesting:

// Allocate a multifab in amrex::Vector<amrex::MultiFab>
// call SingleLevelToBlueprint with the final level and store result in bp_mesh

ascent::Ascent ascent;
conduit::Node open_opts;
open_opts["mpi_comm"] = MPI_Comm_c2f(amrex::ParallelDescriptor::Communicator());
ascent.open(open_opts);

ascent.publish(bp_mesh);

// set up some actions and scenes
ascent.execute(actions);

ascent.close();

I've run a backtrace on the core dump, but I am still none the wiser:

(gdb) bt
#0  0x0000000001972bd0 in ?? ()
#1  0x00002b9f1020d27e in std::shared_ptr<vtkm::cont::RuntimeDeviceTracker>::~shared_ptr() ()
   from /lustre/scafellpike/local/HT05466/xxg04/shared/extern/ascent/build/install/vtk-m-v2.1.0/lib/libvtkm_cont-2.1.so.2.1
#2  0x00002b9f055caec6 in (anonymous namespace)::run (p=<optimized out>)
    at ../../../../libstdc++-v3/libsupc++/atexit_thread.cc:75
#3  0x00002b9f05e55b69 in __run_exit_handlers () from /usr/lib64/libc.so.6
#4  0x00002b9f05e55bb7 in exit () from /usr/lib64/libc.so.6
#5  0x00002b9f05e3e3dc in __libc_start_main () from /usr/lib64/libc.so.6
#6  0x000000000041e826 in _start ()

Could you please give us any suggestions as to what might be going wrong?

The text was updated successfully, but these errors were encountered:

BenWibking · 2024-04-05T16:09:55Z

Is this running on GPU?

Some Ascent actions (data binning, in particular) assume that GPU memory is accessible from the CPU (as is the case on Summit, Frontier, and other unified memory systems).

See: Alpine-DAV/ascent#1122.

WeiqunZhang · 2024-04-05T20:31:06Z

@cyrush Cyrus, do you have any suggestions?

robertsawko · 2024-04-06T08:53:26Z

Thanks for quick replies.

Is this running on GPU?
No, this is an oldish all CPU system.

The really surprising thing is that everything else seems to be working just fine. I am still trying to reproduce the error on the AMReX tutorial. My colleague got this screenshot on the heat equation tutorial:

BenWibking · 2024-04-06T20:11:37Z

I think I misunderstood your original issue.

Your issue actually seems very similar to a different issue I reported here: #2994. I never fully understood why that happened, but it was somehow (?) caused by an unrelated global variable.

WeiqunZhang · 2024-04-06T21:25:52Z

I cannot reproduce the issue with the heat equation test. This is what I did.

$ spack install ascent
$ cd amrex-tutorials/ExampleCodes/Blueprint/HeatEquation_EX1_C/Exec
$ make -j DEBUG=FALSE USE_CONDUIT=TRUE USE_ASCENT=TRUE CONDUIT_DIR=/path/to/spack-installed-conduit  ASCENT_DIR=/path/to/spack-installed-ascent
$ ./main2d.gnu.ex inputs_2d

I also tried DEBUG=TRUE.

cyrush · 2024-04-08T18:40:36Z

@robertsawko do you have any custom classes like mentioned in #2994 ?

If it's a c++ static init and finalize class issue, those are very very hard to reason about. (The order for when things are deallocated is not guaranteed)

I will look into vtkm::cont::RuntimeDeviceTracker so see if it could be subject to something like this.

robertsawko · 2024-04-12T09:41:54Z

I am really sorry - I am taking time to respond. Our HPC has serious I/O issues (which is why working on in situ is even more relevant!), but currently it's just unusable. If it's like this next week still, I will reproduce the environment locally and retry.

cyrush mentioned this issue Apr 9, 2024

2024/04 VTK-m Questions Alpine-DAV/ascent#1270

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ascent appears to cause a segfault #3873

Ascent appears to cause a segfault #3873

robertsawko commented Apr 5, 2024

BenWibking commented Apr 5, 2024

WeiqunZhang commented Apr 5, 2024

robertsawko commented Apr 6, 2024

BenWibking commented Apr 6, 2024

WeiqunZhang commented Apr 6, 2024

cyrush commented Apr 8, 2024

robertsawko commented Apr 12, 2024

Ascent appears to cause a segfault #3873

Ascent appears to cause a segfault #3873

Comments

robertsawko commented Apr 5, 2024

BenWibking commented Apr 5, 2024

WeiqunZhang commented Apr 5, 2024

robertsawko commented Apr 6, 2024

BenWibking commented Apr 6, 2024

WeiqunZhang commented Apr 6, 2024

cyrush commented Apr 8, 2024

robertsawko commented Apr 12, 2024