Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to create cufft plan #1080

Open
walidabualafia opened this issue Feb 6, 2024 · 3 comments
Open

failed to create cufft plan #1080

walidabualafia opened this issue Feb 6, 2024 · 3 comments

Comments

@walidabualafia
Copy link

This is a template for reporting bugs. Please fill in as much information as you can.

I have been using the relion/5.0-beta for a while now. I have a user who ran a very long job, which exits with a failed to create cufft plan. I am not sure what is causing this issue. Most functionality and behavior is correct, and this error just came up while the user was running relion.

Environment:

  • OS: RHEL8
  • MPI runtime: OpenMPI 4.1.5
  • RELION version: 5.0-beta-0-commit-da8ee2
  • Memory: 480 GB
  • GPU: Nvidia A100 x4

Dataset:

  • Box size: 720 px
  • Pixel size: 0.6485 Å/px
  • Number of particles: 1011456
  • Description: 80S ribosome

Job options:

  • Type of job: Refine3D
  • Number of MPI processes: 5
  • Number of threads: 60 (total)
  • Full command (see note.txt in the job directory):
     `which relion_refine_mpi` --o Refine3D/job201/run --auto_refine --split_random_halves --i Polish/job198/shiny.star --ref Refine3D/job185/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /scratch_space/zwatson --pool 64 --pad 2  --ctf --particle_diameter 340 --flatten_solvent --zero_mask --solvent_mask job066-60S-mask.mrc --solvent_correct_fsc  --oversampling 1 --healpix_order 4 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 12 --gpu "" --keep_scratch --pipeline_control Refine3D/job201/
    

Error message:

Please cite the full error message as the example below.

--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   nodegpu214
  Local device: mlx5_0
--------------------------------------------------------------------------
[nodegpu214:77459] 4 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[nodegpu214:77459] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
=== Backtrace  ===
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
[nodegpu214:77459] 3 more processes have sent help message help-mpi-api.txt / mpi-abort```
@biochem-fan
Copy link
Member

I have a user who ran a very long job, which exits with a failed to create cufft plan. I am not sure what is causing this issue. Most functionality and behavior is correct, and this error just came up while the user was running relion.

Does this happen always for the particular user? What happens if the user continues the failed job?

Box size: 720 px
Pixel size: 0.6485 Å/px

Unless the resolution is near 1.3 A, down-sample the particles. This is wasting the storage and processing power.

@walidabualafia
Copy link
Author

I have not had any other users report this issue. I also asked around, and no users have seen it either.

This user encountered the error on 7 different jobs, which do not all contain the same particles. Whenever she hit the error, her batch job would preempt and exit. I'm not sure she is able to continue running the job. She did not encounter the error when she ran her job with version 4.0.1-commit-7809a7.

@biochem-fan
Copy link
Member

Considering that A100 has a huge VRAM, it is not very likely that the program ran out of memory. Nonetheless it is worth trying down-sampled particles. I am sure the user does not need 0.6485 Å/px. With a more reasonable pixel size, the box size would be smaller, using less memory and leading to faster processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants