Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aborted Jobs and Core Dumping #232

Open
scheah86 opened this issue Oct 26, 2023 · 3 comments
Open

Aborted Jobs and Core Dumping #232

scheah86 opened this issue Oct 26, 2023 · 3 comments

Comments

@scheah86
Copy link

scheah86 commented Oct 26, 2023

Issue summary

Hello,

I hope this message finds you well. I am reaching out to report an issue I've encountered and would like to share the details I have gathered so far for further insights.

While processing approximately 3.7 million compounds on the supercomputer, I've faced some core dump issues. I've submitted a total of 1,488 jobs and, while I received the same count in return, my observation suggests that several jobs were terminated prematurely. This has resulted in partial file processing, as depicted in the attached images.

image
image

After rerunning the script with the same compounds in a separate directory, I found that the termination patterns appear to be random. Upon consulting with the technical team managing our supercomputer, I received feedback about a possible "numerical_error". They identified an error suggesting a potential issue with the code's compatibility with my inputs, resulting in the aforementioned core dump.

Given the feedback:
*******************terminate called after throwing an instance of 'numerical_error'
what(): Numerical degeneracy encountered. Check for non-physical inputs.

It's understood that "Numerical degeneracy" might indicate an inability to invert a matrix due to its degenerate nature, among other potential causes. The technical team recommended considering an updated version of the code or adjusting the initial conditions.

I am reaching out to seek your expertise on how to best approach and rectify this situation. Your guidance would be invaluable.

Thank you in advance for your time and assistance. Here attached one of the incomplete processed file. I hope it could help.
AD4_docking_rec_4ZZI_clean_min_EN_HTS_103.txt

Steps to reproduce

  1. Log into a supercomputer account
  2. mv to the directory containing 1488 files in sdf format
  3. Run GNINA script

Your system configuration

Operating system: Swinburne HPC System Ngarrgu Tindebeek
Compiler:
CUDA version (if applicable):
CUDNN version (if applicable):
Python version:

@dkoes
Copy link
Contributor

dkoes commented Oct 26, 2023

If you can build a more recent version of gnina, it will print out a slightly more informative error message. This would help indicate if the problem is related to floating point rounding or nan generation.

@AquifersBSIM
Copy link

Hello dkoes, thank you so much for your input! If I may ask, what would be the more recent version of gnina?

@dkoes
Copy link
Contributor

dkoes commented Oct 27, 2023

The commit that added the more informative message was b879b81, but getting the latest from github would work as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants