Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange problem memory acces while running relion_refine with VDAM algorithm #1119

Open
Gia1975 opened this issue May 3, 2024 · 4 comments

Comments

@Gia1975
Copy link

Gia1975 commented May 3, 2024

Dear Relion Developers,

I am trying to do some 2D classification on a newly collected dataset and I am getting a strange error while running 2DClass.

Actually, I get this error only if I use the VDAM algorithm, not with the EM one.
I've tried to play with the -pool number (tried 1, 10 and 100) and I suspect that it has something to do with disk and/or memory access but I am not expert enough to troubleshoot.

I am processing a preliminary dataset of ~26000 particles, 64x64 pixels
My linux station has 128Gb of RAM and two RTXA5000 cards.

If I read particles directly from the hard drive or if I copy them in SSD (scratch) I get the same error.

I've never had such message before on previous data treatment using the same parameters.

With many thanks,

GIA

COMMAND:

which relion_refine --o Class2D/job027/run --grad --class_inactivity_threshold
0.1 --grad_write_iter 10 --iter 200 --i Extract/job012/particles.star --dont_co
mbine_weights_via_disc --scratch_dir /scratch --pool 100 --pad 2 --ctf --tau2_
fudge 2 --particle_diameter 150 --K 50 --flatten_solvent --zero_mask --strict_
highres_exp 7 --center_classes --oversampling 1 --psi_step 12 --offset_range 5
--offset_step 2 --norm --scale --j 4 --gpu "" --pipeline_control Class2D/job02
7/

ERROR: an illegal memory access was encountered in /home/jenkins/workspace/CCP-EM/sl6_devtoolset/devtools/checkout/relion-ver4.0/src/acc/cuda/custom_allocator.cuh at line 175 (error-code 77)
in: /home/jenkins/workspace/CCP-EM/sl6_devtoolset/devtools/checkout/relion-ver4.0/src/acc/cuda/cuda_settings.h, line 65
ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you

      -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
       and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5),
       this may happen.

      -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
       at least compute 3.5. You may be trying to use a GPU older than
       this. If you have multiple generations, try specifying --gpu
       with X=0. Then try X=1 in a new run, and so on. The numbering of
       GPUs may not be obvious from the driver or intuition. For a list
       of GPU compute generations, see

       en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

      -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
       as to not require this, and may thus have unforeseen requirements
       when run in this mode. If you think it is nonetheless necessary,
       please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

      -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is
       subject to many restrictions, and relion is written to work within
       common restraints. If you have exotic data or settings, unexpected
       configurations may occur. See also above point regarding
       double precision.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues

gpu-ids not specified, threads will automatically be mapped to devices (incrementally).
Thread 0 mapped to device 0
Thread 1 mapped to device 0
Thread 2 mapped to device 1
Thread 3 mapped to device 1
Running CPU instructions in double precision.

  • WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
    Initial subset size set to 200
    Final subset size set to 1329
  • On host gbamod26: free scratch space = 896.592 Gb.
    Copying particles to scratch directory: /scratch/relion_volatile/
    1/ 1 sec ............................................................(,_,">
    For optics_group 1, there are 26597 particles on the scratch disk.
    Estimating initial noise spectra from 1000 particles
    0/ 0 sec ............................................................
    (,,">
    Estimating accuracies in the orientational assignment ...
    0/ 0 sec ............................................................~~(,
    ,">
    Auto-refine: Estimated accuracy angles= 29.1 degrees; offsets= 18.432 Angstroms
    CurrentResolution= 61.44 Angstroms, which requires orientationSampling of at least 45 degrees for a particle of diameter 150 Angstroms
    Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
    OrientationalSampling= 11.25 NrOrientations= 32
    TranslationalSampling= 7.68 NrTranslations= 21
    =============================
    Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
    OrientationalSampling= 5.625 NrOrientations= 256
    TranslationalSampling= 3.84 NrTranslations= 84
    =============================
    Gradient optimisation iteration 1 of 200 with 200 particles (Step size 0.9)
    2/ 2 sec ............................................................(,_,">
    Maximization ...
    0/ 0 sec ............................................................
    (,,">
    CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
    Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
    OrientationalSampling= 11.25 NrOrientations= 32
    TranslationalSampling= 7.68 NrTranslations= 21
    =============================
    Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
    OrientationalSampling= 5.625 NrOrientations= 256
    TranslationalSampling= 3.84 NrTranslations= 84
    =============================
    Gradient optimisation iteration 2 of 200 with 200 particles (Step size 0.9)
    1/ 1 sec ............................................................~~(,
    ,">
    Maximization ...
    0/ 0 sec ............................................................(,_,">
    CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
    Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
    OrientationalSampling= 11.25 NrOrientations= 32
    TranslationalSampling= 7.68 NrTranslations= 21
    =============================
    Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
    OrientationalSampling= 5.625 NrOrientations= 256
    TranslationalSampling= 3.84 NrTranslations= 84
    =============================
    Gradient optimisation iteration 3 of 200 with 200 particles (Step size 0.9)
    0/ 0 sec ............................................................
    (,,">
    Maximization ...
    0/ 0 sec ............................................................~~(,
    ,">
    CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
    Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
    OrientationalSampling= 11.25 NrOrientations= 32
    TranslationalSampling= 7.68 NrTranslations= 21
    =============================
    Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
    OrientationalSampling= 5.625 NrOrientations= 256
    TranslationalSampling= 3.84 NrTranslations= 84
    =============================
    Gradient optimisation iteration 4 of 200 with 200 particles (Step size 0.9)
    000/??? sec ~~(,_,"> oo (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (16384B) (16896B) (16384B) <16384B> [24729147392B] = 24729320448B
@biochem-fan
Copy link
Member

Does this happen with the latest version of RELION (5.0 beta 3)?

@Gia1975
Copy link
Author

Gia1975 commented May 6, 2024

Hi,

No, is Relion 4.0.

Thanks,

GIA

@biochem-fan
Copy link
Member

Did you test RELION 5.0?

We would like to focus bug fixes on RELION 5.0, because 5.0 is getting closer to stable release.

@Gia1975
Copy link
Author

Gia1975 commented May 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants