Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

difficulty running/loading model on GPU #16

Open
murakdar opened this issue Jul 2, 2019 · 10 comments
Open

difficulty running/loading model on GPU #16

murakdar opened this issue Jul 2, 2019 · 10 comments

Comments

@murakdar
Copy link

murakdar commented Jul 2, 2019

I have been trying to predict the structure of a new sequence using the available pre-trained model (CASP11), but I've so far been unsuccessful in running the model. Note that I was equally unsuccessful in training a new model, with similar errors as below, but I will frame this in the context of the prediction task.

First, I successfully followed the input preparation steps provided in the README (i.e. using HMMER and convert scripts). Then, I slightly modified the configuration file to locate the .tfrecord files to be tested. From inside the rgn directory, I run python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing.

The resulting error is:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

A complete log file is found at the end of this message. Training a new model based on the ProteinNet data sets also doesn't work for me, with a similar error. I suspect the underlying culprit is the following line:

2019-07-02 21:39:54.085506: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

However, I know that the machine does have a working GPU on which other applications can run. For example, the command python -c 'import tensorflow as tf; sess = tf.Session(); devices = sess.list_devices(); print(devices)' works as expected; the resulting output is:

2019-07-02 21:51:19.765631: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-02 21:51:19.923013: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-02 21:51:19.923728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-02 21:51:19.923765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-02 21:51:20.365814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-02 21:51:20.365876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-02 21:51:20.365895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-02 21:51:20.366064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14047 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3962879756071663290), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 3582480176640480454), _DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 5594773058756615672), _DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 14730090906, 11541728406927441233)]

I am using TensorFlow 1.12.0 with CUDA 9.0 on Python 2.7.12. Trying with or without export CUDA_VISIBLE_DEVICES=0 had no effect. I'd be happy to provide any additional information that could be useful.

Finally, I'm not sure if it's relevant to this particular issue, but I was also unable to successfully run python tests.py (from within rgn/models). (This is after extracting tests_data.zip and adjusting base_dir on line 20 accordingly.) After some deprecation warnings, here is the output from the first two unit tests:

======================================================================
ERROR: testBidirectionalCudnnLSTM (__main__.CanonicalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 1591, in testBidirectionalCudnnLSTM
    rtol=1e-4, atol=1e-4, use_gpu=True, restart_every_iteration=True)
  File "tests.py", line 223, in _testCore
    m_train.finish(sess, save=True, close_session=False, reset_graph=False)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 492, in _finish
    self._coordinator.join(self._threads)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 257, in _run
    enqueue_callable()
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1267, in _single_tensor_run
    results = self._call_tf_sessionrun(None, {}, fetch_list, [], None)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
NotFoundError: baseDirectory/data/CASP11Thinning30TwoResidueShiftEvoUniParcBakerJackHMMERNeg10JackHMMERNeg10/training/full/1; No such file or directory
	 [[{{node RGN/model_0/read_protein/ReaderReadV2}} = ReaderReadV2[_device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/model_0/read_protein/TFRecordReaderV2, RGN/model_0/file_queue)]]
	 [[{{node RGN/model_0/batching_queue/cond/padding_fifo_queue_enqueue/_36}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_107_RGN/model_0/batching_queue/cond/padding_fifo_queue_enqueue", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

======================================================================
ERROR: testBidirectionality (__main__.CanonicalTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests.py", line 506, in testBidirectionality
    [[9918.58933468,  8952.59069162,  9176.94079926,  8796.51937957, 12218.92350567,  10755.31931812,   9559.9827963 ,   5893.13110397, 5506.3973903 ,   7582.1031883 ,  10850.59082285,  11665.04905976, 10217.72346162,   8608.70925565,   4039.71197761,   9195.48430789, 12097.81036358,   9139.1117249 ,   7955.98830914,   7179.4971963 , 5227.11424296,   7736.59951981,  10184.379717  ,   7659.47643575, 8075.85901917,   2743.33191322]]]})
  File "tests.py", line 230, in _testCore
    m_train, m_evals = self._createModel(c_train, c_evals)
  File "tests.py", line 147, in _createModel
    m_train = RGNModel('training', c_train)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 117, in __init__
    raise RuntimeError('Model already started; cannot create new objects.')
RuntimeError: Model already started; cannot create new objects.

The remaining tests all raise the same RuntimeError: Model already started; cannot create new objects. Moreover, running an individual test doesn't seem to produce any useful output:

$ python tests.py CanonicalTest.testBidirectionality
ERROR:tensorflow:Starting: testBidirectionality
<...snipped warnings...>
ERROR:tensorflow:Finished: testBidirectionality
.
----------------------------------------------------------------------
Ran 1 test in 7.717s

OK

Here is the complete output log file located in ../models/RGN12/logs/CASP12.log:

WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:543: string_input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(string_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:276: input_producer (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensor_slices(input_tensor).shuffle(tf.shape(input_tensor, out_type=tf.int64)[0]).repeat(num_epochs)`. If `shuffle=False`, omit the `.shuffle(...)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:188: limit_epochs (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.from_tensors(tensor).repeat(num_epochs)`.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:197: __init__ (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py:197: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/net_ops.py:115: __init__ (from tensorflow.python.ops.io_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.TFRecordDataset`.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:575: maybe_batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.filter(...).batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/net_ops.py:204: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/geom_ops.py:98: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
*** training configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': False},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 426,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': '../models/RGN12/runs/CASP12/ProteinNet12Thinning90/checkpoints/',
        'data_files': None,
        'data_files_glob': '../models/RGN12/data/ProteinNet12Thinning90/training/[!a-z]*',
        'detailed_logs': True,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': True,
        'log_model_summaries': True,
        'logs_directory': '../models/RGN12/runs/CASP12/ProteinNet12Thinning90/logs/',
        'max_checkpoints': None,
        'name': 'training',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': True,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 32,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 100000,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 10000,
              'bucket_boundaries': None,
              'file_queue_capacity': 1000,
              'min_after_dequeue': 500,
              'num_evaluation_invocations': 1,
              'shuffle': True},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}



*** weighted testing evaluation configuration ***
{'architecture': {'all_to_all_peepholes': False,
                  'all_to_recurrent_skip_connections': False,
                  'alphabet_size': 60,
                  'alphabet_trainable': True,
                  'bidirectional': True,
                  'first_residual_connection_from_nth_layer': 1,
                  'higher_order_layers': True,
                  'include_dihedrals_between_layers': False,
                  'include_evolutionary': True,
                  'include_primary': True,
                  'include_recurrent_outputs_between_layers': True,
                  'input_to_recurrent_skip_connections': False,
                  'recurrent_layer_size': [800, 800],
                  'recurrent_nonlinear_out_proj_function': 'tanh',
                  'recurrent_nonlinear_out_proj_size': None,
                  'recurrent_peepholes': True,
                  'recurrent_to_output_skip_connections': False,
                  'recurrent_unit': 'CudnnLSTM',
                  'residual_connections_every_n_layers': None,
                  'tertiary_output': 'linear_alphabet'},
 'computing': {'allow_gpu_growth': False,
               'default_device': '',
               'fill_gpu': False,
               'functions_on_devices': {'/cpu:0': ['point_to_coordinate']},
               'gpu_fraction': 1.0,
               'num_cpus': 4,
               'num_reconstruction_fragments': 6,
               'num_reconstruction_parallel_iters': 4,
               'num_recurrent_parallel_iters': 1,
               'num_recurrent_shards': 1},
 'curriculum': {'base': 100.0,
                'behavior': None,
                'change_num_iterations': 5,
                'loss_history_subgroup': 'all',
                'mode': None,
                'rate': 0.002,
                'sharpness': 20.0,
                'slope': 1.0,
                'threshold': 5.0,
                'update_loss_history': False},
 'initialization': {'alphabet_init': {'dist': 'uniform', 'range': 3.14159},
                    'alphabet_seed': None,
                    'angle_shift': [0.0, 0.0, 0.0],
                    'dropout_seed': None,
                    'evolutionary_multiplier': 1.0,
                    'graph_seed': 426,
                    'queue_seed': None,
                    'recurrent_forget_bias': 1.0,
                    'recurrent_init': {'base': {'dist': 'uniform',
                                                'range': 0.01},
                                       'bias': {'dist': 'uniform',
                                                'range': 0}},
                    'recurrent_nonlinear_out_proj_init': {'base': {},
                                                          'bias': {}},
                    'recurrent_nonlinear_out_proj_seed': None,
                    'recurrent_out_proj_init': {'base': {'dist': 'uniform',
                                                         'range': 0.01},
                                                'bias': {'dist': 'uniform',
                                                         'range': 0}},
                    'recurrent_out_proj_seed': None,
                    'recurrent_seed': None,
                    'zoneout_seed': None},
 'io': {'alphabet_file': None,
        'checkpoint_every_n_hours': 24,
        'checkpoints_directory': None,
        'data_files': None,
        'data_files_glob': '../models/RGN12/data/ProteinNet12Thinning90/testing/*.tfrecord',
        'detailed_logs': True,
        'evaluation_sub_groups': ['10', '20', '30', '40', '50', '70', '90'],
        'log_alphabet': True,
        'log_model_summaries': True,
        'logs_directory': None,
        'max_checkpoints': None,
        'name': 'evaluation_wt_testing',
        'num_edge_residues': 0,
        'num_evo_entries': 42},
 'loss': {'atoms': 'c_alpha',
          'batch_dependent_normalization': True,
          'include': False,
          'tertiary_normalization': 'first',
          'tertiary_weight': 1.0},
 'optimization': {'alphabet_temperature': 1.0,
                  'batch_size': 1,
                  'beta1': 0.95,
                  'beta2': 0.99,
                  'decay': 0.9,
                  'epsilon': 1e-07,
                  'gradient_threshold': 5.0,
                  'initial_accumulator_value': 0.1,
                  'learning_rate': 0.0001,
                  'momentum': 0.0,
                  'num_epochs': 1,
                  'num_steps': 700,
                  'optimizer': 'adam',
                  'recurrent_threshold': None,
                  'rescale_behavior': 'norm_rescaling'},
 'queueing': {'batch_queue_capacity': 300,
              'bucket_boundaries': None,
              'file_queue_capacity': 10,
              'min_after_dequeue': 10,
              'num_evaluation_invocations': 1,
              'shuffle': False},
 'regularization': {'alphabet_keep_probability': 1.0,
                    'alphabet_normalization': None,
                    'recurrent_input_keep_probability': [0.5, 0.5],
                    'recurrent_keep_probability': 1.0,
                    'recurrent_layer_normalization': False,
                    'recurrent_memory_zonein_probability': 1.0,
                    'recurrent_nonlinear_out_proj_normalization': None,
                    'recurrent_output_keep_probability': 1.0,
                    'recurrent_state_zonein_probability': 1.0,
                    'recurrent_variational_dropout': False}}
2019-07-02 21:39:54.072394: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-02 21:39:54.085506: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-07-02 21:39:54.085564: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: sequence-analysis
2019-07-02 21:39:54.085573: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: sequence-analysis
2019-07-02 21:39:54.085609: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 418.67.0
2019-07-02 21:39:54.085641: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 418.67.0
2019-07-02 21:39:54.085648: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 418.67.0
Traceback (most recent call last):
  File "model/protling.py", line 527, in <module>
    while loop(args): pass
  File "model/protling.py", line 379, in loop
    session = models['training'].start(models.values())
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 450, in _start
    self._saver.restore(session, latest_checkpoint)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1582, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

	 [[node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1251)  = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=426, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]

Caused by op u'RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams', defined at:
  File "model/protling.py", line 527, in <module>
    while loop(args): pass
  File "model/protling.py", line 301, in loop
    models.update({'training': RGNModel('training', configs['training'])})
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 114, in __init__
    self._create_graph(mode, self.config)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 200, in _create_graph
    recurrent_outputs, recurrent_states = _higher_recurrence(mode, recurrence_config, inputs, num_stepss, alphabet=alphabet)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 695, in _higher_recurrence
    layer_recurrent_outputs, layer_recurrent_states = _recurrence(mode, layer_config, layer_inputs, num_stepss)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 789, in _recurrence
    outputs_directed, (_, states_directed) = rnn(inputs_directed, training=is_training)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/layers/base.py", line 374, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 746, in __call__
    self.build(input_shapes)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 352, in build
    opaque_params_t = self._canonical_to_opaque(weights, biases)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 474, in _canonical_to_opaque
    direction=self._direction)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1251, in cudnn_rnn_canonical_to_opaque_params
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_cudnn_rnn_ops.py", line 642, in cudnn_rnn_canonical_to_params
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' with these attrs.  Registered devices: [CPU,XLA_CPU,XLA_GPU], Registered kernels:
  device='GPU'; T in [DT_DOUBLE]
  device='GPU'; T in [DT_FLOAT]
  device='GPU'; T in [DT_HALF]

	 [[node RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at /home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py:1251)  = CudnnRNNCanonicalToParams[T=DT_FLOAT, direction="unidirectional", dropout=0, input_mode="linear_input", num_params=8, rnn_mode="lstm", seed=426, seed2=4497](RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_layers, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/num_units, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams/input_size, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_1, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_2, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_3, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_4, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_5, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_6, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_7, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_8, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_9, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_10, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_11, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_12, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_13, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_14, RGN/training/layer0/fw/cudnn_lstm/cudnn_lstm/random_uniform_15)]]

I greatly appreciate your time in helping to get this working on my end!

@ecvgit
Copy link

ecvgit commented Jul 29, 2019

Hi @murakdar -- were you able to fix this issue?

@murakdar
Copy link
Author

Hello @ecvgit. No, this issue remains unresolved.

@alquraishi
Copy link
Contributor

Hi @murakdar, can you try specifying the GPU explicitly using -g0?

@ecvgit
Copy link

ecvgit commented Jul 31, 2019

I was able to resolve this error. I think it happens because you are not using a compatible CUDNN version. I was able to use TF 12 with CUDNN 7.9.0 and CUDA 9.

@murakdar
Copy link
Author

murakdar commented Jul 31, 2019

Hello @alquraishi; adding -g0 helped, but now the problem is that I don't get any *.tertiary or *.recurrent_states output files, and the command ends with no feedback about why.

Here are the commands I tried and their output:

First, with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0, the log file shows:

<...warnings and configuration snipped; similar to first comment...>
2019-07-31 15:15:20.614840: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 15:15:21.465724: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 15:15:21.466352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 15:15:21.466611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 15:15:35.049897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 15:15:35.049960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-31 15:15:35.049968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-31 15:15:35.050107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15079 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-07-31 15:15:35.856331: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 14.73G (15812263936 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.

To get rid of the resulting memory issue, I tried again with python model/protling.py ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/configuration-test -d ../models/RGN12 -p -e weighted_testing -g0 --gpu_fraction 0.9, which produced the following log:

<...warnings and configuration snipped; similar to first comment...>
2019-07-31 21:18:25.896157: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-31 21:18:26.093152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-31 21:18:26.093743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.52GiB
2019-07-31 21:18:26.093764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-07-31 21:18:26.558373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-31 21:18:26.558445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-07-31 21:18:26.558455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-07-31 21:18:26.558575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13571 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
WARNING:tensorflow:From /home/dariusz/structure/aqlaboratory/rgn/model/model.py:454: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version.
Instructions for updating:
To construct input pipelines, use the `tf.data` module.

It stops running after ~15 seconds. The directory ../models/RGN12/runs/CASP12/ProteinNet12Thinning90/11/outputsTesting/ gets created, but it is empty. I confirmed that no output files are generated anywhere else with a find command sorted by modification time. Other values of the --gpu_fraction do not help.

Any further ideas would be greatly appreciated.

@ecvgit: I am presently using cuDNN 7.1.4. In my first comment, I believe I was using cuDNN 7.6.1. I tried downgrading to fix the issue but at some point got the error E tensorflow/stream_executor/cuda/cuda_dnn.cc:363] Loaded runtime CuDNN library: 7.0.5 but source was compiled with: 7.1.4. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.. So I ultimately settled on version 7.1.4 to ensure compatibility. Edited to add: no difference when using cuDNN 7.6.1.

@ecvgit
Copy link

ecvgit commented Aug 1, 2019

Could you try running it for CASP7?

@ecvgit
Copy link

ecvgit commented Aug 1, 2019

@alquraishi Is it possible to share the .tertiary files for the models reported in the paper? I was able to generate the .tertiary files, but the DRMSD does not match -- which makes it hard to figure out if there is something wrong in my DRMSD computation vs using the wrong .tertiary files.

@murakdar
Copy link
Author

murakdar commented Aug 1, 2019

Could you try running it for CASP7?

Tried, still the same behavior. @ecvgit, if I understand correctly, you have been able to run new predictions with the pre-trained model; could you perhaps share an example FASTA sequence file, corresponding .tfrecord file, and configuration file that I could drop in to one of the pre-trained models?

I did some further debugging and found that I'm hitting tf.errors.OutOfRangeError in the main loop. It's being thrown from RGNModel.predict at

rgn/model/model.py

Lines 320 to 321 in 0133213

# evaluate prediction dict
prediction_dict = ops_to_dict(session, self._prediction_ops)
, which is ultimately calling a tf.Session.run() on the TF ops here. The TF ops being run (i.e. self._prediction_ops) look like this:

{'num_stepss': <tf.Tensor 'RGN/evaluation_wt_testing/num_stepss:0' shape=(1,) dtype=int32>,
 'ids': <tf.Tensor 'RGN/evaluation_wt_testing/ids:0' shape=(1,) dtype=string>,
 'coordinates': <tf.Tensor 'RGN/evaluation_wt_testing/point_to_coordinate:0' shape=(?, 1, 3) dtype=float32>,
 'recurrent_states': <tf.Tensor 'RGN/evaluation_wt_testing/concat:0' shape=(?, 3200) dtype=float32>}

For what it's worth, here's the complete traceback for running an individual op:

(Pdb) session.run(ops['num_stepss'])
*** OutOfRangeError: PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
	 [[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549)  = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
	 [[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op u'RGN/evaluation_wt_testing/batching_queue', defined at:
  File "model/protling.py", line 529, in <module>
    while loop(args): pass
  File "model/protling.py", line 337, in loop
    models.update({'eval_wt_test': RGNModel('evaluation', configs['eval_wt_test'])})
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 114, in __init__
    self._create_graph(mode, self.config)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 179, in _create_graph
    ids, primaries, evolutionaries, secondaries, tertiaries, masks, num_stepss = _dataflow(dataflow_config, max_length)
  File "/home/dariusz/structure/aqlaboratory/rgn/model/model.py", line 549, in _dataflow
    inputs = read_protein(file_queue, max_length, config['num_edge_residues'], config['num_evo_entries'])
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 306, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 1074, in maybe_batch
    name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 787, in _batch
    dequeued = queue.dequeue_many(batch_size, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 478, in dequeue_many
    self._queue_ref, n=n, component_types=self._dtypes, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 3487, in queue_dequeue_many_v2
    component_types=component_types, timeout_ms=timeout_ms, name=name)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/home/dariusz/.virtualenvs/py2-tfgpu/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): PaddingFIFOQueue '_3_RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue' is closed and has insufficient elements (requested 1, current size 0)
	 [[node RGN/evaluation_wt_testing/batching_queue (defined at /home/dariusz/structure/aqlaboratory/rgn/model/model.py:549)  = QueueDequeueManyV2[component_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_INT32, DT_FLOAT, DT_FLOAT, DT_INT32], timeout_ms=-1, _device="/job:localhost/replica:0/task:0/device:CPU:0"](RGN/evaluation_wt_testing/batching_queue/padding_fifo_queue, RGN/evaluation_wt_testing/batching_queue/n)]]
	 [[{{node RGN/evaluation_wt_testing/batching_queue/_169}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_4_RGN/evaluation_wt_testing/batching_queue", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

@ecvgit
Copy link

ecvgit commented Aug 1, 2019

I was able to run the predictions on the proteinnet test set.
I didn't make any changes to the config file. Just extracted RGN7.tar.gz and used the following command.
python protling.py RGN7/runs/CASP7/ProteinNet7Thinning90/configuration -d RGN7 -p -e weighted_testing -g 0

@murakdar
Copy link
Author

murakdar commented Aug 7, 2019

I am now able to run predictions using the default configuration file as indicated -- thank you, @ecvgit and @alquraishi.

However, I am still unable to run predictions of a single new sequence.

The queue/range error in my last comment suggests my problem relates to the .tfrecord file output from the convert_to_tfrecord.py script.

Shall I continue here, or open a separate issue for that? (I'm tempted to prefer the latter, since the g0 option does enable me to run and load on GPU.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants