Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with relion and GPU #1084

Open
relion67 opened this issue Feb 16, 2024 · 2 comments
Open

Crash with relion and GPU #1084

relion67 opened this issue Feb 16, 2024 · 2 comments

Comments

@relion67
Copy link

Hello
I'm writing to you about a problem we're having with the relion program.
We're trying to run processes with RELION version: 4.0-beta-1-commit-1fb5b8 on a centos 7.6 system.
We're using a machine with 4 graphics cards (4 GPUs) and very regularly, when we tell the program to use all 4 GPUs: the program crashes ...
If we only use 2 GPUs, the program takes an infinite amount of time to run...
Recently, we had another problem of this type:
Using this setting for 3Drefine Relion 4.0:

GPU 0,1
MPI 3
THREADS 6

error message

000/??? sec ~~(,_,"> [oo]ERROR: CudaCustomAllocator out of memory
[requestedSpace: 340660736 B]
[largestContinuousFreeSpace: 80307200 B]
[totalFreeSpace: 80307200 B]
(113152B) (114688B) (113152B) (114688B) (113152B) (114688B) (97472000B) (194943488B) (194943488B) (194943488B) (194943488B) (389886464B) (44544B) (2048B) (22528B) (5927424B) (173181440B) (346362880B) (346362880B) (346362880B) (346362880B) (692725248B) (44544B) (2048B) (22528B) (10531328B) (170330624B) (340660736B) (340660736B) (340660736B) [80307200B] = 4808391168B
ERROR: CudaCustomAllocator out of memory
[requestedSpace: 348200448 B]
[largestContinuousFreeSpace: 129793536 B]

(113152B) (114688B) (113152B) (114688B) (113152B) (114688B) (174233088B) (348465664B) (348465664B) (348465664B) (348465664B) (696930816B) (44544B) (2048B) (22528B) (10595328B) (170590720B) (341181440B) (341181440B) (341181440B) (341181440B) (682362880B) (44544B) (2048B) (22528B) (10374144B) (174100480B) [129793536B] = 4808391168B

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 44322 RUNNING AT serveur-linuxvixion
= EXIT CODE: 139
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
RELION version: 4.0-beta-1-commit-1fb5b8
Precision: BASE=double

We'd like to upgrade to a higher version of relion but do you know if there are any constraints with the version of centos currently present on our machine?
Thank you in advance for your help.
Have a nice day!

@biochem-fan
Copy link
Member

First of all, please respect our issue template.
Without details of your dataset and hardware, we cannot provide a good answer.

This is a very common question.
Please search "CudaCustomAllocator" in the CCPEM mailing list https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CCPEM.

We'd like to upgrade to a higher version of relion

You should definitely do so. Why are you still using the beta version of 4.0?

@relion67
Copy link
Author

Hi
Sorry for my presentation
We ll check what you suggest us and let you know after
bvest regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants