Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ascent appears to cause a segfault #3873

Open
robertsawko opened this issue Apr 5, 2024 · 7 comments
Open

Ascent appears to cause a segfault #3873

robertsawko opened this issue Apr 5, 2024 · 7 comments

Comments

@robertsawko
Copy link

Hello,

Together with two colleagues we were using AMReX built-in Ascent integration. We religiously followed the three blueprint tutorials and implemented a function which uses our finest mesh to produce a mesh blueprint with SingleLevelToBlueprint and pass it to Ascent with some actions to execute.

The simulations run okay, generate images as expected, but end up with a segfault:

AMReX (v2024-22-ga611acdeafed) finalized
[sqg1cintr16:55480:0:55480] Caught signal 11 (Segmentation fault: <unknown si_code> at address 0x7ffe7df9bcb0)
==== backtrace ====
 0 0x0000000000036280 killpg()  ???:0
===================
Segmentation fault (core dumped)

This is strange for me in many ways. Firstly, it looks like AMReX actually finalizes fine. I am used to segfaults being quite fatal to running programs.

The problem happens in parallel and in serial. I cannot reproduce with a heat equation tutorial, but one of my colleague reports that he saw something like this in 2D heat equation too.

The code itself is not even very interesting:

// Allocate a multifab in amrex::Vector<amrex::MultiFab>
// call SingleLevelToBlueprint with the final level and store result in bp_mesh

ascent::Ascent ascent;
conduit::Node open_opts;
open_opts["mpi_comm"] = MPI_Comm_c2f(amrex::ParallelDescriptor::Communicator());
ascent.open(open_opts);

ascent.publish(bp_mesh);

// set up some actions and scenes
ascent.execute(actions);

ascent.close();

I've run a backtrace on the core dump, but I am still none the wiser:

(gdb) bt
#0  0x0000000001972bd0 in ?? ()
#1  0x00002b9f1020d27e in std::shared_ptr<vtkm::cont::RuntimeDeviceTracker>::~shared_ptr() ()
   from /lustre/scafellpike/local/HT05466/xxg04/shared/extern/ascent/build/install/vtk-m-v2.1.0/lib/libvtkm_cont-2.1.so.2.1
#2  0x00002b9f055caec6 in (anonymous namespace)::run (p=<optimized out>)
    at ../../../../libstdc++-v3/libsupc++/atexit_thread.cc:75
#3  0x00002b9f05e55b69 in __run_exit_handlers () from /usr/lib64/libc.so.6
#4  0x00002b9f05e55bb7 in exit () from /usr/lib64/libc.so.6
#5  0x00002b9f05e3e3dc in __libc_start_main () from /usr/lib64/libc.so.6
#6  0x000000000041e826 in _start ()

Could you please give us any suggestions as to what might be going wrong?

@BenWibking
Copy link
Contributor

Is this running on GPU?

Some Ascent actions (data binning, in particular) assume that GPU memory is accessible from the CPU (as is the case on Summit, Frontier, and other unified memory systems).

See: Alpine-DAV/ascent#1122.

@WeiqunZhang
Copy link
Member

@cyrush Cyrus, do you have any suggestions?

@robertsawko
Copy link
Author

Thanks for quick replies.

Is this running on GPU?
No, this is an oldish all CPU system.

The really surprising thing is that everything else seems to be working just fine. I am still trying to reproduce the error on the AMReX tutorial. My colleague got this screenshot on the heat equation tutorial:
Screenshot from 2024-04-06 09-52-59

@BenWibking
Copy link
Contributor

I think I misunderstood your original issue.

Your issue actually seems very similar to a different issue I reported here: #2994. I never fully understood why that happened, but it was somehow (?) caused by an unrelated global variable.

@WeiqunZhang
Copy link
Member

I cannot reproduce the issue with the heat equation test. This is what I did.

$ spack install ascent
$ cd amrex-tutorials/ExampleCodes/Blueprint/HeatEquation_EX1_C/Exec
$ make -j DEBUG=FALSE USE_CONDUIT=TRUE USE_ASCENT=TRUE CONDUIT_DIR=/path/to/spack-installed-conduit  ASCENT_DIR=/path/to/spack-installed-ascent
$ ./main2d.gnu.ex inputs_2d

I also tried DEBUG=TRUE.

@cyrush
Copy link
Contributor

cyrush commented Apr 8, 2024

@robertsawko do you have any custom classes like mentioned in #2994 ?

If it's a c++ static init and finalize class issue, those are very very hard to reason about. (The order for when things are deallocated is not guaranteed)

I will look into vtkm::cont::RuntimeDeviceTracker so see if it could be subject to something like this.

@robertsawko
Copy link
Author

I am really sorry - I am taking time to respond. Our HPC has serious I/O issues (which is why working on in situ is even more relevant!), but currently it's just unusable. If it's like this next week still, I will reproduce the environment locally and retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants