Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ILDG checkpointer causes a crash during write #423

Open
vmos1 opened this issue Feb 22, 2023 · 2 comments
Open

Using ILDG checkpointer causes a crash during write #423

vmos1 opened this issue Feb 22, 2023 · 2 comments

Comments

@vmos1
Copy link

vmos1 commented Feb 22, 2023

I'm testing an HMC workflow with the ILDG checkpointer
The sample code can be accessed here
The code runs well with Nersc checkpointer used as :
TheHMC.Resources.LoadNerscCheckpointer(CPparams);

It fails when using ILCGCheckpointer as:
TheHMC.Resources.LoadILDGCheckpointer(CPparams);
 
The code runs until the first checkpoint, then I get a 'core' file and the following errors:


hmc_SDM: 
/grid_prefix/include/Grid/parallelIO/IldgIO.h:616: void Grid::IldgWriter::writeLimeIldgLFN(std::string &): Assertion `err>=0' failed.
srun: error: tioga13: tasks 1-7: Aborted (core dumped) 

The last few lines of the output are :

Grid : Message : 267.892527 s : IOobject:  write 3328 bytes in 0.104757 s 0.0302971 MB/s 
Grid : Message : 267.892535 s : IOobject: endian and checksum overhead 0.000015 s
Grid : Message : 267.892537 s : RNG file checksum 4dc54934
Grid : Message : 267.892538 s : RNG file checksuma 8a569ac0
Grid : Message : 267.892539 s : RNG file checksumb 447f5161
Grid : Message : 267.892540 s : RNG state overhead 0.002102 s

 
Have replicated the error on Crusher (ORNL) and Tioga(LLNL) AMD machines.

Building Grid:
For building Grid, I use the standard procedure with lime, documented here

@paboyle
Copy link
Owner

paboyle commented Sep 18, 2023

Can you please
i) recompile with configure flags including --enable-debug
ii) rerun on a single MPI rank the same volume, using a cold start if necessary.
iii) rerun it under gdb interactively. This core dump should become trapped and you can type "backtrace"
and find out the line of code and hopefully the problem. You can print variables in the local file with print if necessary.

@vmos1
Copy link
Author

vmos1 commented Sep 29, 2023

Recompiled with --enable-debug
Ran on a single MPI rank -> code works fine. Repeating with 2 ranks causes same failure as above.
The rng file is written, but the issue occurs while writing the lat file, which is much bigger.

Using gdb for coredump doesn't yield anything.
"backtrace" gives "No stack"

Any idea why this could only happen for ILDG (not NERSC format) on multiple ranks only ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants