Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs: the results of different parallel schemes vary greatly for LCAO calculations #4122

Open
16 tasks
WHUweiqingzhou opened this issue May 8, 2024 · 7 comments
Assignees
Labels
High Priority Bugs which needs to be fixed ASAP.

Comments

@WHUweiqingzhou
Copy link
Collaborator

Describe the bug

During the test of issue #4058, I find results of different parallel settings are totally different for same INPUT:

OMP_NUM_THREADS=1 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 16 abacus | tee out.log
OMP_NUM_THREADS=2 mpirun -np 8 abacus | tee out.log
OMP_NUM_THREADS=4 mpirun -np 4 abacus | tee out.log

image

see more in link

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

  • Verify the issue is not a duplicate.
  • Describe the bug.
  • Steps to reproduce.
  • Expected behavior.
  • Error message.
  • Environment details.
  • Additional context.
  • Assign a priority level (low, medium, high, urgent).
  • Assign the issue to a team member.
  • Label the issue with relevant tags.
  • Identify possible related issues.
  • Create a unit test or automated test to reproduce the bug (if applicable).
  • Fix the bug.
  • Test the fix.
  • Update documentation (if necessary).
  • Close the issue and inform the reporter (if applicable).
@WHUweiqingzhou WHUweiqingzhou added the High Priority Bugs which needs to be fixed ASAP. label May 8, 2024
@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 9, 2024

I also made tests by using GNU image @dyzheng, I find the calculations are also unstable, but better than Intel image:
image

But for unconverged INPUT, the calculations are more unstable:
image

See more in link.

@WHUweiqingzhou WHUweiqingzhou self-assigned this May 9, 2024
@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 9, 2024

As for different version:

see link.

For v3.3.2, the results of STRU1 and STRU2 are different:

image

For v3.4.0, the results of STRU1 and STRU2 with different MPI are almost same:

image

For v3.5.0, the result of STRU1 and STRU2 with different MPI are different:

image

For v3.6.0, the result is same as v3.5.0

image

It looks like v3.4.0 behaves well, something changed between v3.4.0 and v3.5.0

@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 11, 2024

I choose some commit to make tests, see the link.

For 38766b4, 2023/9/28:
image

For 2ffa3d4, 2023/10/9. It looks like drho changes after this commit:
image

For 77f178d, 2023/10/26:
image

For 57c903a, 2023/11/03:
image

For fd76546, 2023/11/23:
image

@Qianruipku, could you have a look?

@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 13, 2024

I try the commit a5abaea, which is just before 2ffa3d4:
image

I confirm this change happen at 2ffa3d4, see link.

@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 14, 2024

I try mixing_type = pulay and mixing_ndim=21 at a5abaea, and get the result. It looks like old pulay (broyden now) is not stable in this case?
image

link

@WHUweiqingzhou
Copy link
Collaborator Author

WHUweiqingzhou commented May 15, 2024

@Qianruipku
I try different mixing_gg0=0 and scf_thr_type=1 at 2ffa3d4, and find the result is same as Broyden result of a5abaea:
see the link.
For a5abaea:

START CHARGE      : atomic
 DONE(1.32678    SEC) : INIT SCF
 ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   4.623e-02  2.619e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  1.512e-02  2.198e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  1.092e-02  2.198e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   1.606e-02  2.201e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.875509e-01  2.846e-03  2.197e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.610166e+00   3.639e-02  2.200e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.587337e+00  4.259e-03  2.197e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.320307e-02  1.462e-03  2.201e+01  
 GE9    3.43e+01  3.68e+01  -2.012886e+05  5.644168e-02   4.966e-03  2.200e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -4.094097e-02  3.658e-03  2.201e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.928839e-02  1.539e-03  2.202e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.543738e-03  1.574e-03  2.202e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.717234e-03  4.667e-04  2.203e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  2.690787e-03   1.217e-03  2.203e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  -3.728993e-03  4.753e-04  2.204e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.213324e-04  3.090e-04  2.205e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  1.019319e-03   6.257e-04  2.205e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.727669e-03  3.054e-04  2.206e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.660938e-04  1.692e-04  2.212e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.791429e-05  1.023e-04  2.209e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.845928e-05  1.066e-04  2.212e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -4.190659e-06  7.938e-05  2.217e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.090570e-05  5.489e-05  2.213e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  7.023658e-08   6.898e-05  2.212e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -1.937866e-06  5.804e-05  2.213e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.122038e-05  2.331e-05  2.214e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -7.857934e-07  2.666e-05  2.216e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.993593e-07   2.932e-05  2.215e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -2.109869e-06  1.792e-05  2.213e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.184652e-07   2.027e-05  2.217e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  1.038596e-05   6.569e-05  2.214e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -1.034819e-05  2.186e-05  2.214e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  4.226644e-06   4.674e-05  2.217e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -1.963234e-06  3.550e-05  2.217e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -2.668124e-06  2.238e-05  2.217e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -7.664895e-07  1.254e-05  2.217e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  2.685720e-07   1.899e-05  2.217e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  5.085099e-07   2.371e-05  2.215e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -1.756162e-07  2.297e-05  2.217e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -1.341152e-06  1.120e-05  2.217e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  4.999221e-09   7.053e-06  2.215e+01  
 GE42   3.41e+01  3.68e+01  -2.012887e+05  6.138648e-07   1.840e-05  2.215e+01  
 GE43   3.41e+01  3.68e+01  -2.012887e+05  -8.628854e-07  9.731e-06  2.217e+01  
 GE44   3.41e+01  3.68e+01  -2.012887e+05  -1.153533e-07  6.617e-06  2.218e+01  
 GE45   3.41e+01  3.68e+01  -2.012887e+05  -4.853204e-08  5.367e-06  2.218e+01  
 GE46   3.41e+01  3.68e+01  -2.012887e+05  -2.341219e-08  5.924e-06  2.220e+01  

For 2ffa3d4:

ITER   TMAG      AMAG      ETOT(eV)       EDIFF(eV)      DRHO       TIME(s)    
 GE1    3.13e+01  3.21e+01  -2.012837e+05  0.000000e+00   2.314e+00  2.637e+01  
 GE2    3.62e+01  3.73e+01  -2.012882e+05  -4.475781e+00  2.857e-01  2.267e+01  
 GE3    3.47e+01  3.63e+01  -2.012885e+05  -3.216326e-01  8.493e-02  2.270e+01  
 GE4    3.46e+01  3.70e+01  -2.012879e+05  5.839711e-01   5.170e+00  2.271e+01  
 GE5    3.44e+01  3.67e+01  -2.012886e+05  -6.873429e-01  1.088e-02  2.270e+01  
 GE6    3.58e+01  3.81e+01  -2.012850e+05  3.617410e+00   3.629e+02  2.273e+01  
 GE7    3.43e+01  3.68e+01  -2.012886e+05  -3.594388e+00  1.362e+00  2.271e+01  
 GE8    3.42e+01  3.68e+01  -2.012886e+05  -4.305293e-02  2.881e-02  2.273e+01  
 GE9    3.42e+01  3.68e+01  -2.012886e+05  3.112351e-02   3.512e+00  2.277e+01  
 GE10   3.42e+01  3.68e+01  -2.012886e+05  -1.827247e-02  4.005e-01  2.271e+01  
 GE11   3.41e+01  3.69e+01  -2.012887e+05  -1.670464e-02  1.139e-01  2.261e+01  
 GE12   3.41e+01  3.69e+01  -2.012887e+05  -2.902616e-03  1.525e-01  2.242e+01  
 GE13   3.41e+01  3.69e+01  -2.012887e+05  -6.588764e-03  1.453e-03  2.240e+01  
 GE14   3.41e+01  3.69e+01  -2.012887e+05  3.300182e-04   1.390e-02  2.240e+01  
 GE15   3.41e+01  3.69e+01  -2.012887e+05  4.628865e-03   8.823e-02  2.238e+01  
 GE16   3.41e+01  3.69e+01  -2.012887e+05  -6.824514e-03  5.390e-04  2.241e+01  
 GE17   3.41e+01  3.69e+01  -2.012887e+05  4.037616e-04   3.347e-03  2.226e+01  
 GE18   3.41e+01  3.69e+01  -2.012887e+05  -1.141057e-03  1.147e-03  2.222e+01  
 GE19   3.41e+01  3.69e+01  -2.012887e+05  -2.858993e-04  6.974e-05  2.222e+01  
 GE20   3.41e+01  3.69e+01  -2.012887e+05  -4.719985e-05  2.939e-05  2.225e+01  
 GE21   3.41e+01  3.69e+01  -2.012887e+05  -2.902679e-05  4.334e-05  2.225e+01  
 GE22   3.41e+01  3.69e+01  -2.012887e+05  -3.342697e-06  3.658e-05  2.226e+01  
 GE23   3.41e+01  3.69e+01  -2.012887e+05  -1.117724e-05  1.266e-05  2.224e+01  
 GE24   3.41e+01  3.69e+01  -2.012887e+05  -1.517585e-07  4.617e-05  2.225e+01  
 GE25   3.41e+01  3.68e+01  -2.012887e+05  -6.274518e-07  6.574e-05  2.228e+01  
 GE26   3.41e+01  3.68e+01  -2.012887e+05  -1.256247e-05  6.464e-06  2.228e+01  
 GE27   3.41e+01  3.68e+01  -2.012887e+05  -1.080055e-06  7.928e-06  2.229e+01  
 GE28   3.41e+01  3.68e+01  -2.012887e+05  2.439966e-07   2.441e-05  2.229e+01  
 GE29   3.41e+01  3.68e+01  -2.012887e+05  -1.845282e-06  1.931e-05  2.228e+01  
 GE30   3.41e+01  3.68e+01  -2.012887e+05  3.163369e-07   1.136e-05  2.228e+01  
 GE31   3.41e+01  3.68e+01  -2.012887e+05  4.611435e-06   5.006e-04  2.226e+01  
 GE32   3.41e+01  3.68e+01  -2.012887e+05  -4.753171e-06  7.248e-05  2.231e+01  
 GE33   3.41e+01  3.68e+01  -2.012887e+05  3.359972e-06   8.001e-05  2.231e+01  
 GE34   3.41e+01  3.68e+01  -2.012887e+05  -2.829436e-06  4.059e-05  2.228e+01  
 GE35   3.41e+01  3.68e+01  -2.012887e+05  -6.033466e-07  2.800e-05  2.229e+01  
 GE36   3.41e+01  3.68e+01  -2.012887e+05  -1.081985e-06  3.135e-06  2.230e+01  
 GE37   3.41e+01  3.68e+01  -2.012887e+05  8.878815e-07   1.789e-05  2.227e+01  
 GE38   3.41e+01  3.68e+01  -2.012887e+05  6.261401e-08   2.357e-05  2.226e+01  
 GE39   3.41e+01  3.68e+01  -2.012887e+05  -5.403119e-07  1.372e-05  2.225e+01  
 GE40   3.41e+01  3.68e+01  -2.012887e+05  -8.631824e-07  2.357e-06  2.227e+01  
 GE41   3.41e+01  3.68e+01  -2.012887e+05  5.617541e-06   4.904e-07  2.251e+01

@jinzx10
Copy link
Collaborator

jinzx10 commented May 23, 2024

I've got two questions:

  1. It was shown in Unstable LCAO calculation of 004_Li128C75H100O75  #2997 that, even if the parallalization scheme is the same, LCAO calculation may still be unstable for some systems. Are calculations in this PR stable from run to run? [We conjectured that Unstable LCAO calculation of 004_Li128C75H100O75  #2997 might result from a nearly-singular overlap matrix, but so far it is not confirmed and we do not have solution in the near term.]
  2. If calculations in this PR are stable on their own, I was wondering, is it possible to further nail down the problem to MPI or openMP (or both)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Bugs which needs to be fixed ASAP.
Projects
None yet
Development

No branches or pull requests

3 participants