Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPC causes segfault on Frontier #4725

Open
jtkrogel opened this issue Sep 11, 2023 · 2 comments
Open

MPC causes segfault on Frontier #4725

jtkrogel opened this issue Sep 11, 2023 · 2 comments
Labels

Comments

@jtkrogel
Copy link
Contributor

jtkrogel commented Sep 11, 2023

Describe the bug

Use of MPC is unstable on Frontier (CPU code). A handful of FeCl2 runs have segfaulted, one run has produced NaN.

To Reproduce

Build details:

  Git branch: develop
  Last git commit: 283f2438770bdfb592d161d287771764cbf6f96c
  Last git commit date: Sat Aug 26 09:36:21 2023 -0500
  Last git commit subject: Merge pull request #4715 from QMCPACK/prckent-patch-1

Currently Loaded Modules:
  1) craype-x86-trento                      13) darshan-runtime/3.4.0
  2) libfabric/1.15.2.0                     14) hsi/default
  3) craype-network-ofi                     15) DefApps/default
  4) perftools-base/22.12.0                 16) emacs/28.1
  5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta  17) cmake/3.23.2
  6) cray-pmi/6.1.8                         18) openblas/0.3.17
  7) cce/15.0.0                             19) cray-fftw/3.3.10.3
  8) craype/2.7.19                          20) hdf5/1.14.0
  9) cray-dsmml/0.2.2                       21) boost/1.79.0
 10) cray-mpich/8.1.23                      22) rocm/5.5.1
 11) cray-libsci/22.12.1.1                  23) ninja/1.10.2
 12) PrgEnv-cray/8.3.3

Executable:
/lustre/orion/world-shared/mat151/pk7/try_frontier/build_frontier_cpu_real_MP/bin/qmcpack

Problem cases (segfault):

FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180/qmc.out:srun:  error: frontier04992: task 7: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024/qmc.out:srun: error: frontier08960: task 4: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680/qmc.out:srun: error: frontier10366: task 6: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400/qmc.out:srun: error: frontier00384: task 5: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360/qmc.out:srun: error: frontier08319: task 3: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840/qmc.out:srun: error: frontier00208: task 6: Segmentation fault (core dumped)
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720/qmc.out:srun:  error: frontier00201: task 0: Segmentation fault (core dumped)

Problem case (NaN in scalar.dat):
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680

Location on Frontier:
/lustre/orion/mat151/proj-shared/ecp_vdw_test_runs/frontier_files/test_runs_jk_cpu/runs_2023-09-11-09-15-23

To reproduce, copy the relevant files in a new directory and resubmit (sbatch qmc.sbatch.in).

Expected behavior
No segfaults or NaN's

@jtkrogel jtkrogel added the bug label Sep 11, 2023
@prckent
Copy link
Contributor

prckent commented Sep 11, 2023

The NaN is in the scalar.data but the NaN detector in the wavefunction components was not tripped. => There is most likely a problem with just the MPC computation.

runs_2023-09-11-09-15-23]$ grep -n -i NaN */*.scalar.dat
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:3:         1   -1.2273884599e+03    1.5065084464e+06   -1.8482543823e+03    6.2086592258e+02   -1.2991870771e+04    2.0324763583e+02    5.9220107535e+03    5.0183579990e+03                -nan    4.0320000000e+04    8.9543626972e+01    6.6160342262e-01
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:4:         2   -1.2275045133e+03    1.5067927820e+06   -1.8592057578e+03    6.3170124457e+02   -1.2992550874e+04    2.0367290127e+02    5.9113142159e+03    5.0183579990e+03                -nan    4.0320000000e+04    8.9694526033e+01    6.5901697875e-01

@jtkrogel
Copy link
Contributor Author

jtkrogel commented Sep 12, 2023

The segfaults are quasi-reproducible when run with the same seed (single node runs in all cases). The reproduction rate is better than 50%.

Below, * indicates segfaults that appear uniquely in a set of runs. All others reproduce. The behavior is likely non-deterministic and any ported fix should rerun a few times for verification.

Original set:

runs_2023-09-11-09-15-23
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180 
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
  *FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
   FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720 

Reruns:

runs_2023-09-11-12-31-45
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180 
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
  *FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2880
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
  *FeCl2-tile-4-hyb-0-spo-0-est-0-walk-300 
  *FeCl2-tile-4-hyb-0-spo-0-est-0-walk-512 
   FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720 

Also, I observed no NaN's in scalar.dat for the reruns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants