Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in ELPA #3297

Open
oschuett opened this issue Mar 4, 2024 · 3 comments
Open

Segfault in ELPA #3297

oschuett opened this issue Mar 4, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@oschuett
Copy link
Member

oschuett commented Mar 4, 2024

Since #3184 the QS/regtest-almo-2/ion-pair.inp has been segfaulting:

==130454== Invalid write of size 8
==130454==    at 0xA8E793B: memmove (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==130454==    by 0x4CBCE6F: __elpa2_compute_MOD_tridiag_band_real_double (elpa2_tridiag_band_template.F90:1165)
==130454==    by 0x4C83332: __elpa2_impl_MOD_elpa_solve_evp_real_2stage_a_h_a_double_impl (elpa2_template.F90:1155)
==130454==    by 0x4C00A8A: __elpa_impl_MOD_elpa_eigenvectors_a_h_a_d (elpa_impl_math_solvers_template.F90:126)
==130454==    by 0x34534DC: __cp_fm_elpa_MOD_cp_fm_diag_elpa (cp_fm_elpa.F:536)
==130454==    by 0x344E910: __cp_fm_diag_MOD_choose_eigv_solver (cp_fm_diag.F:228)
==130454==    by 0x20AF98E: __preconditioner_makes_MOD_make_full_all.constprop.0 (preconditioner_makes.F:499)
==130454==    by 0x20B27B4: __preconditioner_makes_MOD_make_preconditioner_matrix (preconditioner_makes.F:126)
==130454==    by 0x1EBB40F: __preconditioner_MOD_make_preconditioner (preconditioner.F:202)
==130454==    by 0x1EBC37B: __preconditioner_MOD_prepare_preconditioner (preconditioner.F:430)
==130454==    by 0x1B3584E: __qs_scf_MOD_scf_env_do_scf (qs_scf.F:839)
==130454==    by 0x1B41E53: __qs_scf_MOD_scf (qs_scf.F:246)

I found that changing the OT preconditioner from FULL_ALL to FULL_KINETIC mitigates the problem.

Does anybody have an idea what might be going on?

@oschuett oschuett added the bug Something isn't working label Mar 4, 2024
@fstein93
Copy link
Contributor

fstein93 commented Mar 4, 2024

What happens if you run the test with a single thread? The pdbg-toolchain (single thread) passes with ELPA but the psmp-testers (two threads) do not.
Regarding the preconditioner: FULL_ALL, FULL_SINGLE and FULL_SINGLE_INVERSE require a diagonalization, whereas FULL_KINETIC and FULL_S_INVERSE do not.

@oschuett
Copy link
Member Author

oschuett commented Mar 4, 2024

The psmp binary also crashes with a single thread and rank. Those binaries that don't segfault, produce wrong results instead. Note that I had to lower the threshold in my original PR.

Either this bug is quite old or ALMO somehow assumes a block size of 32.

@oschuett oschuett changed the title Segfault in ALMO Segfault in ELPA Mar 9, 2024
@oschuett
Copy link
Member Author

oschuett commented Mar 9, 2024

It seems that this is not so much an issue with ALMO, but rather with ELPA (or our integration of it).

As a workaround I've now switched regtest-almo-2/ion-pair.inp to scalapack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants