-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPC causes segfault on Frontier #4725
Comments
The NaN is in the scalar.data but the NaN detector in the wavefunction components was not tripped. => There is most likely a problem with just the MPC computation.
|
The segfaults are quasi-reproducible when run with the same seed (single node runs in all cases). The reproduction rate is better than 50%. Below, * indicates segfaults that appear uniquely in a set of runs. All others reproduce. The behavior is likely non-deterministic and any ported fix should rerun a few times for verification. Original set:
Reruns:
Also, I observed no NaN's in scalar.dat for the reruns. |
Describe the bug
Use of MPC is unstable on Frontier (CPU code). A handful of FeCl2 runs have segfaulted, one run has produced NaN.
To Reproduce
Build details:
Problem cases (segfault):
Problem case (NaN in scalar.dat):
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680
Location on Frontier:
/lustre/orion/mat151/proj-shared/ecp_vdw_test_runs/frontier_files/test_runs_jk_cpu/runs_2023-09-11-09-15-23
To reproduce, copy the relevant files in a new directory and resubmit (
sbatch qmc.sbatch.in
).Expected behavior
No segfaults or NaN's
The text was updated successfully, but these errors were encountered: