Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sudden jump in VMC and nan in DMC energies using Frontier #4903

Open
kayahans opened this issue Jan 18, 2024 · 13 comments
Open

Sudden jump in VMC and nan in DMC energies using Frontier #4903

kayahans opened this issue Jan 18, 2024 · 13 comments
Labels

Comments

@kayahans
Copy link
Contributor

kayahans commented Jan 18, 2024

Describe the bug
VMC energies and the variance suddenly jump for twists number 0 and 1. Although, they seem to recover for the both twists, the twist number 1 later on gets nan energies in the DMC calculation.

To Reproduce
Steps to reproduce the behavior:
QMCPACK 3.17.9 (Dec 22nd)
Frontier
Using the Frontier build script
All the input and smaller statistical output files are provided in the attachment
Wavefunctions are provided in
/lustre/orion/mat151/proj-shared/qmcpack_bug_issue_4903

Expected behavior
From Frontier
Local energy
Screen Shot 2024-01-18 at 3 14 31 PM
Variance
Screen Shot 2024-01-18 at 3 14 38 PM

In the figures, it looks like there is only jump in the VMC energies, but
grep nan *scalar.dat
shows persistent nan values in the dmc.g001.s002.scalar.dat file upon inspection.

From Cades:
Local energy
Screen Shot 2024-01-18 at 3 39 13 PM

Variance
Screen Shot 2024-01-18 at 3 39 18 PM

System:
Frontier

Additional context
input and statistical output files
From Frontier
dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500.tar.gz
From Cades
dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_cades.tar.gz

@ye-luo
Copy link
Contributor

ye-luo commented Jan 18, 2024

Could you rerun with exactly the same condition and see if the issue is reproducible?

@kayahans
Copy link
Contributor Author

I ran it twice and observed the jump in VMC in both times. I didn't check for the nan errors in the first try.

@kayahans
Copy link
Contributor Author

Here are the results from the first run I made (the results reported at the top are from the second run):

qmca -q eV *.scalar.dat
LocalEnergy Variance ratio
dmc.g000 series 0 1031677669175257708749823602787387928526500921344.000000 +/- 1026345965942322773734163902878667705320520810496.000000 277413416700286380944563850635856933246045636680252969819277578653873886403926720519161636131019161600.000000 +/- 275979746034657924841166118984974651779648503949948056700927631438939421145829807712974513419170349056.000000 268895436034838101943021826030389515654447120450060288.0000
dmc.g001 series 0 -2759.381274 +/- 0.289404 22234.077176 +/- 20883.370062 8.0576
dmc.g002 series 0 -2759.132788 +/- 0.017212 33.140712 +/- 0.145202 0.0120
dmc.g003 series 0 -313831855792393308948292292026331824128.000000 +/- 311787326824377924783786256208643489792.000000 2037757224439653914892500229815477701337931576086583
9814824687690869529790243667968.000000 +/- 20244817917585162855725057061160757600720053693529940553593344515323351058201182208.000000 64931497132262932133405816933814858135109632.0000

Screen Shot 2024-01-18 at 4 42 08 PM

Comparing 1st and the 2nd run, different twists were affected except for gamma which seems to be problematic in both cases. Inputs and the statistics outputs of the first run are attached here:

dmc_WSe2_AAp_pbe_u_None_4x4x1_2x2x1_2500_first.tar.gz

The first and the second run only differ in the "walkers_per_rank" parameter.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 22, 2024

Could you rerun with export HSA_ENABLE_SDMA=0 in your job script for a known AMD software bug?

@kayahans
Copy link
Contributor Author

With HSA_ENABLE_SDMA=0, it seems to be improved, but not fully resolved. Now, I only see the energy jump in VMC, but no nan values in DMC.
Run 1:

qmca -q eV *.scalar.dat -at

LocalEnergy Variance ratio
avg series 0 71988882972599952.000000 +/- 71494108154439304.000000 2033443914202484946718242814936265785344.000000 +/- 2019468189043949633742443321418877763584.000000 28246637956258387197952.0000
avg series 1 -2762.208943 +/- 0.349528 33.730507 +/- 0.127176 0.0122
avg series 2 -2762.457672 +/- 0.063145 33.267233 +/- 0.095556 0.0120
Screenshot 2024-01-24 at 12 53 06 PM

Screenshot 2024-01-24 at 12 54 16 PM

qmca -q eV *.scalar.dat

LocalEnergy Variance ratio
dmc.g000 series 0 -2759.200685 +/- 0.015851 33.589793 +/- 0.386683 0.0122
dmc.g000 series 1 -2762.312811 +/- 0.309343 34.011915 +/- 0.422985 0.0123
dmc.g000 series 2 -2762.369345 +/- 0.081484 33.179224 +/- 0.347585 0.0120

dmc.g001 series 0 210207538279997248.000000 +/- 209153859773708928.000000 5934342188785026234929344173559143989248.000000 +/- 5904595925332901835902739230938844102656.000000 28230872390886664175616.0000
dmc.g001 series 1 -2762.129798 +/- 0.420648 34.077298 +/- 0.264550 0.0123
dmc.g001 series 2 -2762.497175 +/- 0.066004 33.121632 +/- 0.199665 0.0120

dmc.g002 series 0 -2759.158370 +/- 0.020109 33.546908 +/- 0.294709 0.0122
dmc.g002 series 1 -2762.184378 +/- 0.285604 33.127810 +/- 0.281526 0.0120
dmc.g002 series 2 -2762.498208 +/- 0.102641 33.159337 +/- 0.234883 0.0120

dmc.g003 series 0 -2759.098096 +/- 0.022484 33.131910 +/- 0.231165 0.0120
dmc.g003 series 1 -2762.208786 +/- 0.392482 33.634674 +/- 0.185123 0.0122
dmc.g003 series 2 -2762.459931 +/- 0.026254 33.521872 +/- 0.473860 0.0121

Run 2:

qmca -q eV *.scalar.dat -at

LocalEnergy Variance ratio
avg series 0 -2759.165978 +/- 0.015274 158.794921 +/- 124.896919 0.0576
avg series 1 -2762.284685 +/- 0.352293 33.699636 +/- 0.158610 0.0122
avg series 2 -2762.576489 +/- 0.036107 33.397349 +/- 0.196986 0.0121

Screenshot 2024-01-24 at 4 46 54 PM Screenshot 2024-01-24 at 4 47 01 PM

qmca -q eV *.scalar.dat

LocalEnergy Variance ratio
dmc.g000 series 0 -2759.225573 +/- 0.018721 32.917302 +/- 0.157844 0.0119
dmc.g000 series 1 -2762.497002 +/- 0.338240 33.687777 +/- 0.145556 0.0122
dmc.g000 series 2 -2762.647535 +/- 0.046788 33.400399 +/- 0.364039 0.0121

dmc.g001 series 0 -2759.127363 +/- 0.014157 33.528921 +/- 0.217131 0.0122
dmc.g001 series 1 -2762.123021 +/- 0.334773 33.710596 +/- 0.213986 0.0122
dmc.g001 series 2 -2762.494155 +/- 0.058186 33.305032 +/- 0.499683 0.0121

dmc.g002 series 0 -2759.163331 +/- 0.013737 33.054767 +/- 0.174415 0.0120
dmc.g002 series 1 -2762.142878 +/- 0.373252 33.205315 +/- 0.198137 0.0120
dmc.g002 series 2 -2762.580997 +/- 0.045814 33.453149 +/- 0.511991 0.0121

dmc.g003 series 0 -2759.147645 +/- 0.060131 535.335990 +/- 499.286335 0.1940
dmc.g003 series 1 -2762.375838 +/- 0.365324 34.056943 +/- 0.498526 0.0123
dmc.g003 series 2 -2762.583269 +/- 0.058106 33.385943 +/- 0.289469 0.0121

@ye-luo
Copy link
Contributor

ye-luo commented Feb 9, 2024

It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?

@prckent
Copy link
Contributor

prckent commented Feb 12, 2024

@ye-luo Is hybridrep+GPU incomplete or known to be buggy or just not tested enough (etc.)? If it is known to be incomplete then it should be blocked off or have an unmissable warning printed.

@kayahans Have you been able to run this elsewhere (NERSC CPUs?)? It is more important that you can publish the science than spend any time chasing this.

@kayahans
Copy link
Contributor Author

@prckent I ran these calculations in Cades.I have attached the input files I used and the trace data plots in the issue post at the top.

@kayahans
Copy link
Contributor Author

kayahans commented Feb 12, 2024

It seems that you are using hybridrep + GPU, this is still under development. Could you run with gpu=no to sposet_builder line?

@ye-luo Should I run this in Frontier again?

@ye-luo
Copy link
Contributor

ye-luo commented Feb 13, 2024

@kayahans

  1. are runs on Cades all good? If not, we probably need to first look into other reason for its failure before touching GPUs.
  2. Regarding hybrd on GPU, it should technically work, code paths are routed through single walker API and make tests pass but the performance is very pool. So it is not recommended for using on GPU right now. If you have production needs on GPU, it is recommended to just run hybrid SPO on CPU.
  3. Why the code is behaving strangely on Frontier, it is hard to make a guess. To rule out AMD software issue, I would to have runs on NVIDIA machines first to rule out bad code on our side.

@kayahans
Copy link
Contributor Author

kayahans commented Feb 15, 2024

Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?

@ye-luo
Copy link
Contributor

ye-luo commented Feb 15, 2024

Thanks @ye-luo, yes I had no such issues when running this particular or other bilayered materials at Cades which is a CPU only machine. I think your suggestion is to run the same calculation on Polaris?

My suggestions is putting hybridrep on CPU even you are using GPU.

@kayahans
Copy link
Contributor Author

@ye-luo Running with the hybrid rep on CPU seems to solve the problem. I didn't see any spikes in VMC energy with hybrid rep on CPU. Here are the VMC total energies compared with the run in Cades vs Frontier, they are identical:
Cades:

                            LocalEnergy               Variance           ratio
avg  series 0  -2759.145563 +/- 0.006540   33.248169 +/- 0.064987   0.0121

Frontier:

                            LocalEnergy               Variance           ratio
avg  series 0  -2759.149956 +/- 0.005542   33.390082 +/- 0.090175   0.0121

Frontier VMC trace:

Screenshot 2024-02-20 at 2 56 47 PM Screenshot 2024-02-20 at 2 55 23 PM

@jtkrogel jtkrogel added the bug label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants