Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when above ~700 atoms using ASE #218

Open
tgmaxson opened this issue Oct 14, 2023 · 5 comments
Open

Segfault when above ~700 atoms using ASE #218

tgmaxson opened this issue Oct 14, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@tgmaxson
Copy link

Describe the bug
Systems which are too big seem to fail when using the ASE calculator

To Reproduce
Steps to reproduce the behaviour:

from ase.build import bulk
from dftd4.ase import DFTD4

calc = DFTD4(method="PBE")

size = 9
atoms = bulk("Ag") * (size, size, size)
print(len(atoms))

atoms.calc = calc
atoms.get_potential_energy()

This seems to crash immediately with a segfault across multiple clusters. The cutoff is somewhere around 680 atoms we think, but I do not remember the exact point where it starts failing. This is not "method", atomic species, or volume-dependent it seems.

Interestingly, D3 works fine using the simple D3 calculator. VASP also manages to calculate this fine.

@marvinfriede
Copy link
Member

It works if you set ulimit -s unlimited in your environment, as also suggested in the xtb docs.

@tgmaxson
Copy link
Author

Isnt the requirement to make the stack larger like this typically considered a bug / bad practice? Why is DFTD4 going down such an extreme recursive path that scales by atom count?

I will however check this now.

@marvinfriede
Copy link
Member

marvinfriede commented Oct 15, 2023

Additional info: I tested a large molecule (1000 atoms, coord.txt), and it

  • works with the standalone program version 3.5.0 and 3.6.0
    (compiled with ifort and meson setup ... --buildtype=release --default-library=static -Dfortran_link_args="-static" -Dfortran_args="-Ofast -axAVX2 -mtune=core-avx2 -fma")
  • segfaults with older versions of standalone program (tested 3.3.0 and 3.4.0)
  • segfaults with ase
  • segfaults with dftd4 version (3.5.0) from conda

If more than one core is used, one also has to (expectedly) increase OMP_STACKSIZE.

Inspecting the error with version 3.3.0 suggests that the problem comes from the multicharge library.

Error log
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
dftd4              00000000020E1F6A  Unknown               Unknown  Unknown
dftd4              00000000022DF200  Unknown               Unknown  Unknown
dftd4              0000000000453F41  multicharge_model         131  model.f90
dftd4              00000000021FE773  Unknown               Unknown  Unknown
dftd4              00000000021C0DCC  Unknown               Unknown  Unknown
dftd4              00000000021908B8  Unknown               Unknown  Unknown
dftd4              0000000000453A49  multicharge_model         131  model.f90
dftd4              000000000044FAD2  multicharge_model         462  model.f90
dftd4              000000000043A35D  dftd4_charge_mp_g          67  charge.f90
dftd4              000000000040F08D  dftd4_disp_mp_get          82  disp.f90
dftd4              00000000004063A2  MAIN__                    150  main.f90
dftd4              00000000004053D2  Unknown               Unknown  Unknown
dftd4              00000000022E06A0  Unknown               Unknown  Unknown
dftd4              00000000004052B7  Unknown               Unknown  Unknown

@tgmaxson
Copy link
Author

I tested up to 6,000,000 atoms in ASE with the stack size changed, worked fine. Still would be good if we didn't need to increase the stack but this at least works for us.

Maybe a warning can be thrown by the code when the stacksize is limited and more than 500 or so atoms are used? What I find weird is it worked in VASP I believe. This is something that is an actual bug potentially in the ASE interface / mamba.

@marvinfriede
Copy link
Member

I found more context and helpful explanations (here, here, and especially here).

To summarize the most important points:

  • it is compiler-dependent, if certain arrays go to the stack or the heap (there are certain flags to modify this behavior)
  • the default stack size is 8mb, and even setting it to unlimited only increases it to 64mb (on my machine, ulimit -a | grep "\-s")
  • coming back to the error trace of my previous comment: the error seems to come from the multicharge library, in particular from the get_amat_0d subroutine, so it is not surprising that D3 works without problems

Since it comes from a dependency, I do not think we should change anything in DFT-D4. I am no sure if increasing the stack size is problematic; it also seems supported by the Fortran community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants