Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: CUDA driver version is insufficient for CUDA runtime version. #4

Closed
AnnaLoveland opened this issue Sep 8, 2020 · 5 comments

Comments

@AnnaLoveland
Copy link

Hi,
I am having trouble running deepEMhancer when I used the suggested or alternate installation. The program seems installed because -h option runs correctly but when I try to calculate on GPU I get an error. I'm attaching the error message and my CUDA specs.
Can't wait to see what deepEMhancer can do for my maps!
Thank you,
Anna

(deepEMhancer) [loveland]$ deepemhancer -i run_half1_class001_unfil.mrc -o firstry.mrc
updating environment to select gpu: [0]
Using TensorFlow backend.
loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... Traceback (most recent call last):
  File "/home/exx/.conda/envs/deepEMhancer/bin/deepemhancer", line 11, in <module>
    sys.exit(commanLineFun())
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
    main( ** parseArgs() )
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 70, in main
    predictor= AutoProcessVol(checkpoint_fname, gpuIds= gpuIds, batch_size= batch_size)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 31, in __init__
    self.model = load_model(model_fname, nGpus=nGpus )
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/utils/loadModel.py", line 19, in load_model
    model= load_model(checkpoint_fname, custom_objects=custom_objects )
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/keras/engine/saving.py", line 419, in load_model
    model = _deserialize_model(f, custom_objects, compile)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/keras/engine/saving.py", line 287, in _deserialize_model
    K.batch_set_value(weight_value_tuples)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2470, in batch_set_value
    get_session().run(assign_ops, feed_dict=feed_dict)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 186, in get_session
    _SESSION = tf.Session(config=config)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1570, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/home/exx/.conda/envs/deepEMhancer/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 693, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

(deepEMhancer_env) [loveland]$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2018 NVIDIA Corporation

Built on Sat_Aug_25_21:08:01_CDT_2018

Cuda compilation tools, release 10.0, V10.0.130

(deepEMhancer_env) [loveland]$ nvidia-smi

Tue Sep  8 08:15:15 2020       

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  GeForce RTX 2080    On   | 00000000:3B:00.0  On |                  N/A |

| 24%   32C    P8    21W / 215W |    269MiB /  7951MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   1  GeForce RTX 2080    On   | 00000000:5E:00.0 Off |                  N/A |

| 22%   29C    P8     1W / 215W |      0MiB /  7952MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   2  GeForce RTX 2080    On   | 00000000:86:00.0 Off |                  N/A |

| 23%   31C    P8     2W / 215W |      0MiB /  7952MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   3  GeForce RTX 2080    On   | 00000000:D8:00.0 Off |                  N/A |

| 23%   32C    P8     7W / 215W |      0MiB /  7952MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

                                                                               

+-----------------------------------------------------------------------------+

| Processes:                                                       GPU Memory |

|  GPU       PID   Type   Process name                             Usage      |

|=============================================================================|

|    0     18806      G   /usr/bin/X                                   114MiB |

|    0     19714      G                                                 19MiB |

|    0     19994      G   /usr/bin/gnome-shell                         134MiB |

+-----------------------------------------------------------------------------+


@rsanchezgarc
Copy link
Owner

Dear Anna,

The problem is caused by the incompatibility between NVIDIA Driver version (in your case 410.79) and CUDA 10.1 required version (>= 418.39), which is the one we use by default in deepEMhancer.
Hopefully, deepEMHancer should also work with CUDA 10.0, so I have prepared an alternative installation file that should make it work under your setup.

In order to install it

  1. Remove the old installation
conda env remove -n deepEMhancer
  1. Update the repository
cd path/to/repository/deepEMhancer
git pull
  1. Create the environment using the new file
conda env create -f alternative_installation/deepEMhancer_cud10.0.env.yml  -n deepEMhancer
  1. Install deepEMHancer
conda activate deepEMhancer
python -m pip install . --no-deps

Could you please try it and tell me if it works?

If it works I will update the README to consider cuda versions.

@AnnaLoveland
Copy link
Author

Hi,
This changed things, but it did not finish the run. Here is the output:

(deepEMhancer) [loveland]$ deepemhancer -i run_half1_class001_unfil.mrc -o FirstTry.mrc
updating environment to select gpu: [0]
Using TensorFlow backend.
loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 42.0 % of volume side
DONE!. Shape at 1 A/voxel after padding->  (144, 144, 144)
Neural net inference
  3%|███▍                                                                                                                         | 1/36 [00:00<00:00, 26051.58it/s]2020-09-08 09:48:00.340335: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-09-08 09:48:00.356168: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
  3%|███▌                                                                                                                            | 1/36 [00:02<01:23,  2.39s/it]
Traceback (most recent call last):
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/bin/deepemhancer", line 10, in <module>
    sys.exit(commanLineFun())
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
    main( ** parseArgs() )
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 73, in main
    voxel_size=boxSize, apply_postprocess_cleaning=cleaningStrengh)
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 186, in predict
    batch_y_pred= self.model.predict_on_batch(np.expand_dims(batch_x, axis=-1))
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/keras/engine/training.py", line 1274, in predict_on_batch
    outputs = self.predict_function(ins)
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/EMAN_2.21/envs/deepEMhancer/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv3d_1/convolution}}]]
	 [[activation_10/Identity/_609]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
	 [[{{node conv3d_1/convolution}}]]
0 successful operations.
0 derived errors ignored.

@rsanchezgarc
Copy link
Owner

rsanchezgarc commented Sep 8, 2020

Hi Anna,

Well, at least we have solved the CUDA problem... Now I think that the new problem is caused by an incompatibility between CUDA 10.0 and the cudnn version that has been installed by anaconda. There are many github issues about this problem in tensorflow github, e.g. tensorflow/tensorflow#24496

Could you please check which versions are installed?

Within the environment, could you execute

conda list 

and paste the result?

Apart from that, could you try to execute deepEMhancer passing the following flag TF_FORCE_GPU_ALLOW_GROWTH='true'

E.g.

TF_FORCE_GPU_ALLOW_GROWTH='true' deepemhancer -i ~/tmp/useCasesDeepVol/EMD-0193.mrc -o ~/tmp/outVolDeepEMhancer/out.mrc

Many thanks

Ruben

@rsanchezgarc rsanchezgarc reopened this Sep 8, 2020
@AnnaLoveland
Copy link
Author

Hi Ruben,
The TF_FORCE_GPU_ALLOW_GROWTH='true' allows the program to run. There is one error with insufficient memory that occurs but it seems to recover and outputs a volume. That map looks quite good, but is it the best possible map given the error?

(deepEMhancer) [root@c105491 Relion-JANNI]# TF_FORCE_GPU_ALLOW_GROWTH='true' deepemhancer -i run_half1_class001_unfil.mrc -o FirstTry.mrc
updating environment to select gpu: [0]
Using TensorFlow backend.
loading model /root/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 42.0 % of volume side
DONE!. Shape at 1 A/voxel after padding->  (144, 144, 144)
Neural net inference
  3%|██▎                                                                                | 1/36 [00:00<00:00, 45100.04it/s]2020-09-08 14:20:32.902403: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 2.99G (3214478336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
100%|█████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:53<00:00,  1.50s/it]

See the condo list output below.


(deepEMhancer) [loveland]$ conda list
# packages in environment at /usr/local/EMAN_2.21/envs/deepEMhancer:
#
# Name                    Version                   Build  Channel
_tflow_select             2.1.0                       gpu    anaconda
absl-py                   0.9.0                    py36_0    anaconda
astor                     0.8.1                    py36_0    anaconda
blas                      1.0                         mkl    anaconda
brotlipy                  0.7.0           py36h7b6447c_1000    anaconda
c-ares                    1.15.0            h7b6447c_1001    anaconda
ca-certificates           2020.7.22                     0    anaconda
certifi                   2020.6.20                py36_0    anaconda
cffi                      1.14.2           py36he30daa8_0    anaconda
chardet                   3.0.4                 py36_1003    anaconda
cloudpickle               1.6.0                      py_0    anaconda
cryptography              3.1              py36h1ba5d50_0    anaconda
cudatoolkit               10.0.130                      0    anaconda
cudnn                     7.6.5                cuda10.0_0    anaconda
cupti                     10.0.130                      0    anaconda
cycler                    0.10.0                   py36_0    anaconda
cytoolz                   0.10.1           py36h7b6447c_0    anaconda
dask-core                 2.25.0                     py_0    anaconda
dbus                      1.13.12              h746ee38_0    anaconda
decorator                 4.4.2                      py_0    anaconda
deepemhancer              0.13                     pypi_0    pypi
expat                     2.2.9                he6710b0_2    anaconda
fontconfig                2.13.0               h9420a91_0    anaconda
freetype                  2.10.2               h5ab3b9f_0    anaconda
gast                      0.4.0                      py_0    anaconda
glib                      2.56.2               hd408876_0    anaconda
google-pasta              0.2.0                      py_0    anaconda
grpcio                    1.31.0           py36hf8bcb03_0    anaconda
gst-plugins-base          1.14.0               hbbd80ab_1    anaconda
gstreamer                 1.14.0               hb453b48_1    anaconda
h5py                      2.9.0            py36h7918eee_0    anaconda
hdf5                      1.10.4               hb1b8bf9_0    anaconda
icu                       58.2                 he6710b0_3    anaconda
idna                      2.8                      py36_0    anaconda
imageio                   2.9.0                      py_0    anaconda
importlib-metadata        1.7.0                    py36_0    anaconda
intel-openmp              2020.2                      254    anaconda
joblib                    0.13.2                   py36_0    anaconda
jpeg                      9b                   habf39ab_1    anaconda
keras                     2.2.4                         0    anaconda
keras-applications        1.0.8                      py_1    anaconda
keras-base                2.2.4                    py36_0    anaconda
keras-contrib             2.0.8                    pypi_0    pypi
keras-preprocessing       1.1.0                      py_1    anaconda
keras-radam               0.12.0                   pypi_0    pypi
kiwisolver                1.2.0            py36hfd86e86_0    anaconda
lcms2                     2.11                 h396b838_0    anaconda
ld_impl_linux-64          2.33.1               h53a641e_7    anaconda
libedit                   3.1.20191231         h14c3975_1    anaconda
libffi                    3.3                  he6710b0_2    anaconda
libgcc-ng                 9.1.0                hdf63c60_0    anaconda
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libpng                    1.6.37               hbc83047_0    anaconda
libprotobuf               3.12.4               hd408876_0    anaconda
libstdcxx-ng              9.1.0                hdf63c60_0    anaconda
libtiff                   4.1.0                h2733197_1    anaconda
libuuid                   1.0.3                h1bed415_2    anaconda
libxcb                    1.14                 h7b6447c_0    anaconda
libxml2                   2.9.10               he19cac6_1    anaconda
llvmlite                  0.29.0           py36hd408876_0    anaconda
lz4-c                     1.9.2                he6710b0_1    anaconda
markdown                  3.2.2                    py36_0    anaconda
matplotlib                3.3.1                         0    anaconda
matplotlib-base           3.3.1            py36h817c723_0    anaconda
mkl                       2019.4                      243    anaconda
mkl-service               2.3.0            py36he904b0f_0    anaconda
mkl_fft                   1.1.0            py36h23d657b_0    anaconda
mkl_random                1.1.0            py36hd6b4f25_0    anaconda
mrcfile                   1.1.2                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1    anaconda
networkx                  2.5                        py_0    anaconda
numba                     0.45.1           py36h962f231_0    anaconda
numpy                     1.16.6           py36hbc911f0_0    anaconda
numpy-base                1.16.6           py36hde5b4d6_0    anaconda
olefile                   0.46                     py36_0    anaconda
openssl                   1.1.1g               h7b6447c_0    anaconda
pandas                    0.25.3           py36he6710b0_0    anaconda
pcre                      8.44                 he6710b0_0    anaconda
pillow                    7.2.0            py36hb39fc2d_0    anaconda
pip                       19.2.2                   py36_0    anaconda
protobuf                  3.12.4           py36he6710b0_0    anaconda
pycparser                 2.20                       py_2    anaconda
pyopenssl                 19.1.0                     py_1    anaconda
pyparsing                 2.4.7                      py_0    anaconda
pyqt                      5.9.2            py36h22d08a2_1    anaconda
pysocks                   1.7.1                    py36_0    anaconda
python                    3.6.10               h7579374_2    anaconda
python-dateutil           2.8.1                      py_0    anaconda
pytz                      2020.1                     py_0    anaconda
pywavelets                1.1.1            py36h7b6447c_0    anaconda
pyyaml                    5.3.1            py36h7b6447c_1    anaconda
qt                        5.9.7                h5867ecd_1    anaconda
readline                  8.0                  h7b6447c_0    anaconda
requests                  2.22.0                   py36_1    anaconda
scikit-image              0.15.0           py36he6710b0_0    anaconda
scipy                     1.3.1            py36h7c811a0_0    anaconda
setuptools                49.6.0                   py36_0    anaconda
sip                       4.19.24          py36he6710b0_0    anaconda
six                       1.15.0                     py_0    anaconda
sqlite                    3.33.0               h62c20be_0    anaconda
tensorboard               1.14.0           py36hf484d3e_0    anaconda
tensorflow                1.14.0          gpu_py36h57aa796_0    anaconda
tensorflow-base           1.14.0          gpu_py36h8d69cac_0    anaconda
tensorflow-estimator      1.14.0                     py_0    anaconda
tensorflow-gpu            1.14.0               h0d30ee6_0    anaconda
termcolor                 1.1.0                    py36_1    anaconda
tk                        8.6.10               hbc83047_0    anaconda
toolz                     0.10.0                     py_0    anaconda
tornado                   6.0.4            py36h7b6447c_1    anaconda
tqdm                      4.42.1                     py_0    anaconda
urllib3                   1.25.10                    py_0    anaconda
werkzeug                  1.0.1                      py_0    anaconda
wheel                     0.35.1                     py_0    anaconda
wrapt                     1.12.1           py36h7b6447c_1    anaconda
xz                        5.2.5                h7b6447c_0    anaconda
yaml                      0.2.5                h7b6447c_0    anaconda
zipp                      3.1.0                      py_0    anaconda
zlib                      1.2.11               h7b6447c_3    anaconda
zstd                      1.4.4                h0b5b093_3    anaconda


@rsanchezgarc
Copy link
Owner

rsanchezgarc commented Sep 8, 2020

Hi Anna,

I think the error has not an impact on the quality of the results. Still, if you want to get rid of it, try a smaller batch_size, e.g --batch_size 4 or even --batch_size 2.

I would also recommend you to explore different normalization and -p options to get different results, but this is another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants