strange problem memory acces while running relion_refine with VDAM algorithm #1119

Gia1975 · 2024-05-03T14:34:22Z

Dear Relion Developers,

I am trying to do some 2D classification on a newly collected dataset and I am getting a strange error while running 2DClass.

Actually, I get this error only if I use the VDAM algorithm, not with the EM one.
I've tried to play with the -pool number (tried 1, 10 and 100) and I suspect that it has something to do with disk and/or memory access but I am not expert enough to troubleshoot.

I am processing a preliminary dataset of ~26000 particles, 64x64 pixels
My linux station has 128Gb of RAM and two RTXA5000 cards.

If I read particles directly from the hard drive or if I copy them in SSD (scratch) I get the same error.

I've never had such message before on previous data treatment using the same parameters.

With many thanks,

GIA

COMMAND:

which relion_refine --o Class2D/job027/run --grad --class_inactivity_threshold
0.1 --grad_write_iter 10 --iter 200 --i Extract/job012/particles.star --dont_co
mbine_weights_via_disc --scratch_dir /scratch --pool 100 --pad 2 --ctf --tau2_
fudge 2 --particle_diameter 150 --K 50 --flatten_solvent --zero_mask --strict_
highres_exp 7 --center_classes --oversampling 1 --psi_step 12 --offset_range 5
--offset_step 2 --norm --scale --j 4 --gpu "" --pipeline_control Class2D/job02
7/

ERROR: an illegal memory access was encountered in /home/jenkins/workspace/CCP-EM/sl6_devtoolset/devtools/checkout/relion-ver4.0/src/acc/cuda/custom_allocator.cuh at line 175 (error-code 77)
in: /home/jenkins/workspace/CCP-EM/sl6_devtoolset/devtools/checkout/relion-ver4.0/src/acc/cuda/cuda_settings.h, line 65
ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5),
this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
at least compute 3.5. You may be trying to use a GPU older than
this. If you have multiple generations, try specifying --gpu
with X=0. Then try X=1 in a new run, and so on. The numbering of
GPUs may not be obvious from the driver or intuition. For a list
of GPU compute generations, see

en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
as to not require this, and may thus have unforeseen requirements
when run in this mode. If you think it is nonetheless necessary,
please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is
subject to many restrictions, and relion is written to work within
common restraints. If you have exotic data or settings, unexpected
configurations may occur. See also above point regarding
double precision.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues

gpu-ids not specified, threads will automatically be mapped to devices (incrementally).
Thread 0 mapped to device 0
Thread 1 mapped to device 0
Thread 2 mapped to device 1
Thread 3 mapped to device 1
Running CPU instructions in double precision.

WARNING: Changing psi sampling rate (before oversampling) to 11.25 degrees, for more efficient GPU calculations
Initial subset size set to 200
Final subset size set to 1329
On host gbamod26: free scratch space = 896.592 Gb.
Copying particles to scratch directory: /scratch/relion_volatile/
1/ 1 sec ............................................................(,_,">
For optics_group 1, there are 26597 particles on the scratch disk.
Estimating initial noise spectra from 1000 particles
0/ 0 sec ............................................................(,,">
Estimating accuracies in the orientational assignment ...
0/ 0 sec ............................................................~~(,,">
Auto-refine: Estimated accuracy angles= 29.1 degrees; offsets= 18.432 Angstroms
CurrentResolution= 61.44 Angstroms, which requires orientationSampling of at least 45 degrees for a particle of diameter 150 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 7.68 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 3.84 NrTranslations= 84
=============================
Gradient optimisation iteration 1 of 200 with 200 particles (Step size 0.9)
2/ 2 sec ............................................................(,_,">
Maximization ...
0/ 0 sec ............................................................(,,">
CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 7.68 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 3.84 NrTranslations= 84
=============================
Gradient optimisation iteration 2 of 200 with 200 particles (Step size 0.9)
1/ 1 sec ............................................................~~(,,">
Maximization ...
0/ 0 sec ............................................................(,_,">
CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 7.68 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 3.84 NrTranslations= 84
=============================
Gradient optimisation iteration 3 of 200 with 200 particles (Step size 0.9)
0/ 0 sec ............................................................(,,">
Maximization ...
0/ 0 sec ............................................................~~(,,">
CurrentResolution= 49.152 Angstroms, which requires orientationSampling of at least 36 degrees for a particle of diameter 150 Angstroms
Oversampling= 0 NrHiddenVariableSamplingPoints= 33600
OrientationalSampling= 11.25 NrOrientations= 32
TranslationalSampling= 7.68 NrTranslations= 21
=============================
Oversampling= 1 NrHiddenVariableSamplingPoints= 1075200
OrientationalSampling= 5.625 NrOrientations= 256
TranslationalSampling= 3.84 NrTranslations= 84
=============================
Gradient optimisation iteration 4 of 200 with 200 particles (Step size 0.9)
000/??? sec ~~(,_,"> oo (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [1024B] (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (16384B) (16896B) (16384B) <16384B> [24729147392B] = 24729320448B

The text was updated successfully, but these errors were encountered:

biochem-fan · 2024-05-03T23:19:39Z

Does this happen with the latest version of RELION (5.0 beta 3)?

Gia1975 · 2024-05-06T10:02:54Z

Hi,

No, is Relion 4.0.

Thanks,

GIA

biochem-fan · 2024-05-06T11:28:25Z

Did you test RELION 5.0?

We would like to focus bug fixes on RELION 5.0, because 5.0 is getting closer to stable release.

Gia1975 · 2024-05-16T17:08:44Z

Hi, I found out why... the user installed CCPEM so the command Relion was pointing to the Relion bundled into CCPEM, which apparently does not work well. The Relion4 complied from scratch works just fine. Thanks ! GIA

On 2024-05-06 13:28, biochem_fan wrote: Did you

test RELION 5.0?

We would like to focus bug fixes on RELION 5.0,

because 5.0 is getting closer to stable release.

-- Reply to

this email directly, view it on GitHub [1], or unsubscribe [2].

You

are receiving this because you authored the thread.Message ID: ***@***.***> Links: ------ [1] #1119 (comment) [2] https://github.com/notifications/unsubscribe-auth/A4H3BQORDLFQ6IP6LBKQRH3ZA5SO7AVCNFSM6AAAAABHFUH6J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJVG44TMMJSGM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange problem memory acces while running relion_refine with VDAM algorithm #1119

strange problem memory acces while running relion_refine with VDAM algorithm #1119

Gia1975 commented May 3, 2024

biochem-fan commented May 3, 2024

Gia1975 commented May 6, 2024

biochem-fan commented May 6, 2024

Gia1975 commented May 16, 2024 via email

strange problem memory acces while running relion_refine with VDAM algorithm #1119

strange problem memory acces while running relion_refine with VDAM algorithm #1119

Comments

Gia1975 commented May 3, 2024

biochem-fan commented May 3, 2024

Gia1975 commented May 6, 2024

biochem-fan commented May 6, 2024

Gia1975 commented May 16, 2024 via email