Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing on a in tensorflow_cc.so on Windows 7 on Quadro R5000 16Gb with v1.12 and CUDA 10.0.130 and CUDNN 7.4.2.24 OK under Windows 10 Quadro P5000 and GTX 1060 6Gb #27441

Closed
samhodge opened this issue Apr 3, 2019 · 31 comments
Assignees
Labels
comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype:windows Windows Build/Installation Issues type:bug Bug

Comments

@samhodge
Copy link

samhodge commented Apr 3, 2019

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    I have linked against the //tensorflow:libtensorflow_cc.so and //tensorflow:libtensorflow_framework.so targets using other libs, abseil-cpp, libprotobuf etc
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Windows 10 (build) and Window 7 (deployment)
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): v1.12.0
    (tensorflow-cuda10) C:\Users\user\dev\tensorflow-cuda10\tensorflow\tensorflow\core\common_runtime\gpu>python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
    b'v1.12.0-0-ga6d8ffae09' 1.12.0
  • Python version: 3.6 (N/A)
  • Bazel version (if compiling from source): 0.19.2
  • GCC/Compiler version (if compiling from source):MSVC 14.0
  • CUDA/cuDNN version: 10.0.130, 7.4.2.24
  • GPU model and memory: GTX 1060 6Gb and Quadro R5000 16Gb

You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

(tensorflow-cuda10) C:\Users\user\dev\tensorflow-cuda10\tensorflow\tensorflow\core\common_runtime\gpu>python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'v1.12.0-0-ga6d8ffae09' 1.12.0

Describe the current behavior
The application is currently crashed when initialising the session on the Quadro card on the client's computer running Windows 7 with the error messsage:

2019-04-02 11:30:18.871580: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

here is the code for that file, where LOG FATAL is line 274

// This function must be called periodically to check whether pending
// events have recorded, and then retire them.  Initial observations
// suggest that typical behavior in a TensorFlow program is to have
// 0-3 events pending most of the time, but there are occasionally
// spikes of up to several hundred outstanding.
//
// NOTE: If all events are on the same stream, no later event will
// complete before an earlier event, except possibly if the earlier
// event transitions to an error state, so there's no advantage in
// looking past the first kPending event.  However, if we're using
// multiple streams there may be some gain in looking deeper.
// As a compromise, PollEvent() calls that are triggered by the queueing
// of a single event never look past the first kPending event.  Calls
// coming from the dedicated polling thread always sweep the full queue.
//
// Note that allowing the queue to grow very long could cause overall
// GPU memory use to spike needlessly.  An alternative strategy would
// be to throttle new Op execution until the pending event queue
// clears.
void EventMgr::PollEvents(bool is_dedicated_poller,
                          gtl::InlinedVector<InUse, 4>* to_free) {
  VLOG(2) << "PollEvents  free_events_ " << free_events_.size()
          << " used_events_ " << used_events_.size();
  // Sweep the remaining events in order.  If this is the dedicated
  // polling thread, check the entire set.  Otherwise, just sweep up to
  // the first non-complete record that is still pending.
  for (auto& iu : used_events_) {
    if (iu.event == nullptr) continue;
    se::Event::Status s = iu.event->PollForStatus();
    switch (s) {
      case se::Event::Status::kUnknown:
      case se::Event::Status::kError:
        // We don't expect to see these.  Someday maybe propagate
        // a Status error, but for now fail hard.
        LOG(FATAL) << "Unexpected Event status: " << static_cast<int>(s);
        break;
      case se::Event::Status::kPending:
        if (!is_dedicated_poller) return;  // quit processing queue
        break;
      case se::Event::Status::kComplete:
        // Make a copy of the InUse record so we can free it after releasing
        // the lock
        to_free->push_back(iu);
        free_events_.push_back(iu.event);
        // Mark this InUse record as completed.
        iu.event = nullptr;
    }
  }
  // Then clear any completed InUse records from the front of the queue.
  while (!used_events_.empty()) {
    InUse& iu = used_events_.front();
    if (iu.event == nullptr) {
      used_events_.pop_front();
    } else {
      break;
    }
  }
}

}  // namespace tensorflow

Describe the expected behavior
I would expect the software to load the graph into a fresh session and compute

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
tensorflow::SessionOptions options;
tensorflow::ConfigProto* config = &options.config;
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.9);
device_count->insert({ "GPU",1});
}
device_count->insert({ "CPU", 1 });
//bytes is read from graph_file_name
graph_def->ParseFromArray(bytes.data(), (int)bytes.size()))
session>reset(tensorflow::NewSession(options);
std::cout << "Rotobot: Swapping to model: " << graph_file_name << " using a single model per render is more efficent" << std::endl;
//crashes after here
auto status = (*session)->Create(graph_def);
auto status2 = (*session)->Run(Input_Tensors);

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

You can download the built software from:
https://kognat.com/product/rotobot-openfx-plugin-windows-64-gpu-v1-2-0-rc2-cuda-10/

You will just need an OpenFX host like Natron
https://natrongithub.github.io/

This tutorial will give you reproduction steps
https://kognat.com/2019/03/28/rotobot-srgb/

@samhodge
Copy link
Author

samhodge commented Apr 3, 2019

OK walking backwards through the call stack I can see this could have been raised from

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/gpu/gpu_device.cc#L543

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/gpu/gpu_util.cc#L166

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/gpu/gpu_util.cc#L238

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/gpu/gpu_util.cc#L288

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/core/common_runtime/gpu/gpu_util.cc#L334

Actually a grep is easier

tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:42:// The callback provided to EventMgr::ThenExecute must not block or take a long
tensorflow/core/common_runtime/gpu/gpu_event_mgr.h:99:  inline void ThenExecute(se::Stream* stream, std::function<void()> func) {
tensorflow/core/common_runtime/gpu/gpu_event_mgr_test.cc:259:  em.ThenExecute(stream.get(), [&hit, &note]() {
tensorflow/core/common_runtime/gpu/gpu_util.cc:166:  dev_info->event_mgr->ThenExecute(
tensorflow/core/common_runtime/gpu/gpu_util.cc:238:  dev_info->event_mgr->ThenExecute(
tensorflow/core/common_runtime/gpu/gpu_util.cc:288:  dev_info->event_mgr->ThenExecute(
tensorflow/core/common_runtime/gpu/gpu_util.cc:334:  dev_info->event_mgr->ThenExecute(
tensorflow/core/common_runtime/gpu/gpu_util_platform_specific.cc:40:Status GPUDeviceContext::ThenExecute(Device* device, se::Stream* stream,
tensorflow/core/common_runtime/gpu/gpu_util_platform_specific.cc:44:  gpu_info->event_mgr->ThenExecute(stream, func);
tensorflow/core/common_runtime/gpu_device_context.h:63:  Status ThenExecute(Device* device, se::Stream* stream,
tensorflow/core/common_runtime/ring_reducer.cc:580:    Status s = gpu_info->default_context->ThenExecute(
tensorflow/core/common_runtime/ring_reducer.cc:587:          errors::Internal("Failed to dispatch ThenExecute in RingReducer");
tensorflow/core/framework/device_base.h:98:  virtual Status ThenExecute(Device* device, stream_executor::Stream* stream,
tensorflow/core/framework/device_base.h:100:    return errors::Internal("ThenExecute not supported by device");
tensorflow/core/kernels/check_numerics_op.cc:208:    context->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
tensorflow/core/kernels/crop_and_resize_op.cc:823:  context->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
tensorflow/core/kernels/cuda_device_array.h:89:    context_->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
tensorflow/core/kernels/cuda_solvers.cc:247:      ->event_mgr->ThenExecute(stream, std::move(cb));
tensorflow/core/kernels/dynamic_partition_op_gpu.cu.cc:318:    c->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
tensorflow/core/kernels/segment_reduction_ops.cc:292:    context->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(
tensorflow/core/kernels/where_op.cc:358:    context->device()->tensorflow_gpu_device_info()->event_mgr->ThenExecute(

I am surprised I am not getting any error messages out of

https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/stream_executor/cuda/cuda_driver.cc

The verbosity of the application is turned down to error only, maybe I can supply the client with a build with maximum verbosity and see if that helps trace the error.

@samhodge
Copy link
Author

samhodge commented Apr 3, 2019

OK my action plan for now is to build a version with maximum debug and see what is happening and what is not happening.

Thanks for listening.

sam

@samhodge
Copy link
Author

samhodge commented Apr 5, 2019

I got a new error report

With error level 3 only

tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node xception_65/entry_flow/conv1_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xception_65/entry_flow/conv1_1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, xception_65/entry_flow/conv1_1/weights)]]
[[{{node SemanticPredictions/_45}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2428_SemanticPredictions", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

@samhodge
Copy link
Author

samhodge commented Apr 5, 2019

Everything points to a runtime environment problem #24828 (comment)

Is it possible to write a diagnostic tool which prints the path and versions of CuDNN and CUDA in the client’s runtime environment?

Sam

@samhodge
Copy link
Author

samhodge commented Apr 6, 2019

It all leads back to cudnn64_7.dll not having an overly specific name see https://en.wikipedia.org/wiki/DLL_Hell

@samhodge
Copy link
Author

samhodge commented Apr 6, 2019

Could be related to this. #24496

@samhodge
Copy link
Author

samhodge commented Apr 6, 2019

The suggested solution on that ticket is to allow growth

https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth

But this is a performance decision

@samhodge
Copy link
Author

samhodge commented Apr 7, 2019

Just confirmed this code:

options.config.mutable_gpu_options()->set_allow_growth(true);
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(fraction);

This results in a properly calculated graph

\\options.config.mutable_gpu_options()->set_allow_growth(true);
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(fraction);

@samhodge
Copy link
Author

samhodge commented Apr 7, 2019

Here is the log without set allow growth

Rotobot: Model Decrypting Started... Decrypting Ended!
Rotobot: Calculating with the follow CUDA enabled GPU
Rotobot: Device Number: 0
Rotobot:   Device name: GeForce GTX 1060 6GB
Rotobot:   Using VRAM percentage 80.4%
2019-04-07 09:56:42.668345: W tensorflow/stream_executor/cuda/cuda_driver.cc:416] A non-primary context 0000022A3A10C490 for device 0 exists before initializing the StreamExecutor. The primary context is now 0000022A765700B0. We haven't verified StreamExecutor works with that.
2019-04-07 09:56:42.699831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:0f:00.0
totalMemory: 6.00GiB freeMemory: 4.77GiB
2019-04-07 09:56:42.725273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 09:56:44.840990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 09:56:44.860442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-04-07 09:56:44.870580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-04-07 09:56:44.879572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4938 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:0f:00.0, compute capability: 6.1)
Rotobot: Swapping to model: C:\Program Files (x86)\Kognat/rotobot_segmentation.pb using a single model per render is more efficent
2019-04-07 09:57:05.076919: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.82G (5178684160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

@samhodge
Copy link
Author

samhodge commented Apr 7, 2019

With allow growth enabled, uses Shared GPU memory as well as Dedicated GPU memory.

Rotobot: Model Decrypting Started... Decrypting Ended!
Rotobot: Calculating with the follow CUDA enabled GPU
Rotobot: Device Number: 0
Rotobot:   Device name: GeForce GTX 1060 6GB
Rotobot:   Using VRAM percentage 80.6%
2019-04-07 10:20:07.363840: W tensorflow/stream_executor/cuda/cuda_driver.cc:416] A non-primary context 00000218A6A37420 for device 0 exists before initializing the StreamExecutor. The primary context is now 00000218E5E1DD60. We haven't verified StreamExecutor works with that.
2019-04-07 10:20:07.391016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:0f:00.0
totalMemory: 6.00GiB freeMemory: 4.74GiB
2019-04-07 10:20:07.413503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 10:20:08.383296: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 10:20:08.399261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-04-07 10:20:08.408422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-04-07 10:20:08.417466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4954 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:0f:00.0, compute capability: 6.1)
Rotobot: Swapping to model: C:\Program Files (x86)\Kognat/rotobot_segmentation.pb using a single model per render is more efficent
2019-04-07 10:20:28.264662: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.84G (5195658240 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-04-07 10:20:39.033128: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2019-04-07 10:20:39.055123: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
Rotobot: tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node xception_65/entry_flow/conv1_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xception_65/entry_flow/conv1_1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, xception_65/entry_flow/conv1_1/weights)]]
         [[{{node SemanticPredictions/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2428_SemanticPredictions", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

@samhodge
Copy link
Author

samhodge commented Apr 7, 2019

Looks like this tool

http://docs.nvidia.com/cuda/cuda-memcheck/index.html

Can be useful if I can instruct the clients about how to use it

@samhodge
Copy link
Author

samhodge commented Apr 7, 2019

With allow growth turned on there is nothing insightful from cuda-memtool

see

C:\WINDOWS\system32>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cuda-memcheck.exe" --report-api-errors all "C:\Program Files\Nuke11.2v5\Nuke11.2.exe"
========= CUDA-MEMCHECK
Nuke 11.2v5, 64 bit, built Nov 20 2018.
Copyright (c) 2018 The Foundry Visionmongers Ltd.  All Rights Reserved.
Licence expires on: 2019/5/23
A QuickTime install could not be detected. Reading and writing of QuickTime files will be limited.
Disk cache C:/Users/user/AppData/Local/Temp/nuke\ViewerCache/??: 8036MB (79% of 10240MB) used in 1412 files.
Rotobot: Model Decrypting Started... Decrypting Ended!
Rotobot: Calculating with the follow CUDA enabled GPU
Rotobot: Device Number: 0
Rotobot:   Device name: GeForce GTX 1060 6GB
Rotobot: Swapping to model: C:\Program Files (x86)\Kognat/rotobot_segmentation.pb using a single model per render is more efficent
Rotobot: tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node xception_65/entry_flow/conv1_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xception_65/entry_flow/conv1_1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, xception_65/entry_flow/conv1_1/weights)]]
         [[{{node SemanticPredictions/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2428_SemanticPredictions", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

@samhodge
Copy link
Author

samhodge commented Apr 8, 2019

Without memory growth on I am finally able to get a trace from Natron

C:\Users\user>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cuda-memcheck.exe" "C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe"
========= CUDA-MEMCHECK
Natron Version 2.3.14
Copyright (C) 2013-2018 INRIA and Alexandre Gauthier-Foichat
>>>Use the --help or -h option to print usage.<<<
Info: init.py script not loaded (this is not an error)
Info: initGui.py script not loaded (this is not an error)
Rotobot: Model Decrypting Started... Decrypting Ended!
Rotobot: Calculating with the follow CUDA enabled GPU
Rotobot: Device Number: 0
Rotobot:   Device name: GeForce GTX 1060 6GB
Rotobot:   Using VRAM percentage 83.1%
2019-04-08 09:33:09.663952: W tensorflow/stream_executor/cuda/cuda_driver.cc:416] A non-primary context 0000000044ED9EA0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0000000049662B00. We haven't verified StreamExecutor works with that.
2019-04-08 09:33:09.691736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:0f:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-04-08 09:33:09.727184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-08 09:33:15.074830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-08 09:33:15.089958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-04-08 09:33:15.104339: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-04-08 09:33:15.118188: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5107 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:0f:00.0, compute capability: 6.1)
Rotobot: Swapping to model: C:\Program Files (x86)\Kognat/rotobot_segmentation.pb using a single model per render is more efficent
2019-04-08 09:33:33.417588: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.99G (5355410176 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-04-08 09:34:28.049345: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-04-08 09:34:28.067807: E tensorflow/stream_executor/cuda/cuda_dnn.cc:373] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
Rotobot: tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node xception_65/entry_flow/conv1_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xception_65/entry_flow/conv1_1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, xception_65/entry_flow/conv1_1/weights)]]
         [[{{node SemanticPredictions/_45}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2428_SemanticPredictions", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
========= Error: process didn't terminate successfully
========= Program hit CUDA_ERROR_OUT_OF_MEMORY (error 2) due to "out of memory" on CUDA API call to cuMemAlloc_v2.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\SYSTEM32\nvcuda.dll (cuD3D9UnmapVertexBuffer + 0x1b1147) [0x1bf482]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::MessagePattern::getPatternString + 0x45b6) [0x32eedd6]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::MessagePattern::getPatternString + 0x63780) [0x334dfa0]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x2e6de3) [0x31f68d3]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x382dc) [0x32aa36c]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x36bad) [0x32a8c3d]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x36880) [0x32a8910]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x36a4a) [0x32a8ada]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (tensorflow::TensorShapeBase<tensorflow::TensorShape>::dim_size + 0xb0db) [0x33fd62b]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (tensorflow::TensorShapeBase<tensorflow::TensorShape>::dim_size + 0xf942) [0x3401e92]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x2f54be) [0x3204fae]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x2f5215) [0x3204d05]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::CharString::length + 0xe9f4b) [0x650b4b]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::CharString::length + 0xe8361) [0x64ef61]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ReorderingBuffer::getStart + 0x1438a) [0x341e2ba]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x612b) [0x32781bb]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (tensorflow::NewSession + 0x33c17) [0x3269827]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (tensorflow::NewSession + 0x342df) [0x3269eef]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1ca132) [0x30d9c22]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1c5162) [0x30d4c52]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1c7786) [0x30d7276]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1c6a34) [0x30d6524]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1d5821) [0x30e5311]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ICUServiceKey::getID + 0x7c550) [0x134d040]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0xa069) [0x327c0f9]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0xbba4) [0x327dc34]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0xba41) [0x327dad1]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::ResourceTable::getSize + 0x900) [0x3272990]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1cc884) [0x30dc374]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1cf508) [0x30deff8]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1d2603) [0x30e20f3]
=========     Host Frame:C:\Program Files (x86)\Kognat\shared_libraries\tensorflow_cc.dll (icu_62::UCharsTrieBuilder::matchNodesCanHaveValues + 0x1d2299) [0x30e1d89]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (drawMasksDP + 0x5c0) [0xa7bf0]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (rotobotSegmentationPlugin::computeMask + 0x959) [0xa46a9]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (rotobotSegmentationPlugin::setupAndProcess + 0x1c7) [0xafe27]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (rotobotSegmentationPlugin::render + 0xc3) [0xaef93]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (OFX::Private::renderAction + 0x6c) [0xc495c]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (OFX::Private::mainEntryStr + 0xb2a) [0xc2dda]
=========     Host Frame:C:\Program Files\Common Files\OFX\Plugins\rotobot.ofx.bundle\Contents\win64\rotobot.ofx (OFX::FactoryMainEntryHelper<rotobotSegmentationPluginFactory>::mainEntry + 0x45) [0xa9cb5]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x418122) [0x71f782]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x3794ce) [0x680b2e]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x42080b) [0x727e6b]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x17c8cf) [0x483f2f]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x284736) [0x58bd96]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x28e572) [0x595bd2]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x29158c) [0x598bec]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x272663) [0x579cc3]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x277c85) [0x57f2e5]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x5a577) [0x361bd7]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (initNatronEngine + 0x5fe57) [0x3674b7]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe (ZN5boost7archive6detail11oserializerINS0_15binary_oarchiveESt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISA_EEEC2Ev + 0xbd821) [0x99b521]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\QtCore4.dll (ZN17QThreadPoolThread3runEv + 0x18f) [0xf12f]
=========     Host Frame:C:\Program Files\INRIA\Natron-2.3.14\bin\QtCore4.dll (ZN7QThread21setTerminationEnabledEb + 0x268) [0x1a318]
=========     Host Frame:C:\WINDOWS\System32\msvcrt.dll (beginthreadex + 0x126) [0x3aa96]
=========     Host Frame:C:\WINDOWS\System32\msvcrt.dll (endthreadex + 0xac) [0x3ab6c]
=========
========= No CUDA-MEMCHECK results found

@samhodge
Copy link
Author

samhodge commented Apr 8, 2019

Interestingly running without cuda-memcheck.exe doesnt result in a crash.

but the program is in a zombie state after closing

C:\Users\user>"C:\Program Files\INRIA\Natron-2.3.14\bin\Natron.exe"
Natron Version 2.3.14
Copyright (C) 2013-2018 INRIA and Alexandre Gauthier-Foichat
>>>Use the --help or -h option to print usage.<<<
Info: init.py script not loaded (this is not an error)
Info: initGui.py script not loaded (this is not an error)
Rotobot: Model Decrypting Started... Decrypting Ended!
Rotobot: Calculating with the follow CUDA enabled GPU
Rotobot: Device Number: 0
Rotobot:   Device name: GeForce GTX 1060 6GB
Rotobot:   Using VRAM percentage 82.4%
2019-04-08 09:39:09.238191: W tensorflow/stream_executor/cuda/cuda_driver.cc:416] A non-primary context 00000000422B89B0 for device 0 exists before initializing the StreamExecutor. The primary context is now 0000000046202EF0. We haven't verified StreamExecutor works with that.
2019-04-08 09:39:09.261324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:0f:00.0
totalMemory: 6.00GiB freeMemory: 4.89GiB
2019-04-08 09:39:09.281762: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-08 09:39:10.424100: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-08 09:39:10.435463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-04-08 09:39:10.443461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-04-08 09:39:10.453030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:0f:00.0, compute capability: 6.1)
Rotobot: Swapping to model: C:\Program Files (x86)\Kognat/rotobot_segmentation.pb using a single model per render is more efficent
2019-04-08 09:39:30.327633: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 4.95G (5310059520 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

Sorry the process does complete after being in a suspended state for a few seconds.

@samhodge
Copy link
Author

samhodge commented Apr 8, 2019

I made this for the clients to help with debugging on their systems so I may as well leave it here

The result will be watermarked

this might be informative

I made a video to help you get better debug information

https://www.youtube.com/watch?v=oWULIoJlrto

Download and install Natron
https://sourceforge.net/projects/natron/files/Windows/64/releases/Natron-2.3.14-Windows-x86_64bit-setup.exe/download

so we have a known quantity with OpenFX hosts.

Close all other applications

Then follow the tutorial about how to open Natron from the command prompt

Then playback a few frames and see if it crashes.

The footage used in the clip is here:
http://bit.ly/IMG_6463-MOV

The debug installer is here:
http://bit.ly/Kognat-1-2-0-RC2-cuda10-debug-windows-installer

If or when it does crash if you can say what you did before it crashing and provided the information from the command prompt, just highlight and use Ctrl-C and Ctrl-V to put it into a text document

There are subtitles on the YouTube video, the audio isnt very good.

@samhodge samhodge changed the title Failing on a in tensorflow_cc.so on Windows 7 on Quadro R5000 16Gb with v1.12 and CUDA 10.0.130 and CUDNN 7.4.2.24 OK under Windows 10 Quadro 1060 6Gb Failing on a in tensorflow_cc.so on Windows 7 on Quadro R5000 16Gb with v1.12 and CUDA 10.0.130 and CUDNN 7.4.2.24 OK under Windows 10 Quadro P5000 and GTX 1060 6Gb Apr 9, 2019
@muddham muddham self-assigned this Apr 9, 2019
@muddham muddham added subtype:windows Windows Build/Installation Issues type:build/install Build and install issues comp:runtime c++ runtime, performance issues (cpu) type:bug Bug and removed type:build/install Build and install issues labels Apr 9, 2019
@samhodge
Copy link
Author

samhodge commented Apr 10, 2019 via email

@muddham muddham assigned jvishnuvardhan and unassigned muddham Apr 16, 2019
@samhodge
Copy link
Author

Memory management in TF is the greatest cause of bugs in my application, any help would be useful for the entire community

@jvishnuvardhan
Copy link
Contributor

@samhodge There are several tutorials listed on TF website to deal about the memory management. Please take a look at them. There are articles on best practices of Tensorflow lite. You could search for similar articles on the internet.

This is not Build/Installation or Bug/Performance issue. Please post this kind of support questions at Stackoverflow. There is a big community to support and learn from your questions. GitHub is mainly for addressing bugs in installation and performance. Thanks!

@jvishnuvardhan jvishnuvardhan added the stat:awaiting response Status - Awaiting response from author label May 1, 2019
@samhodge
Copy link
Author

samhodge commented May 1, 2019

The only relevant article is this https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth which is not in the tutorials.

I have set the value to 0.95 percentage of memory available for allocation by testing using CUDA device API.

When other applications get into this GPU memory space it results in a segfault.

This can be done easily by opening a browser or similar.

This makes my customers upset.

There is no way to deallocate TF memory apart from session->reset() which doesn't actually work.

see
#20387
#1578
#5302

If memory management is so well documented why are capable C++ coders having issues with it?

@samhodge
Copy link
Author

samhodge commented May 1, 2019

The use case is as follows, there are several models that are run in the one C++ software application.

to create the next session a singleton is used for the session and a new model is allocated but only the first allocation will set aside the VRAM to be used.

So in 5.2Gb of VRAM is allocated and one model is running and no other application uses that VRAM everything is OK.

Then you switch a new model, there is some memory fragmentation and the other application on the host machine allocate some VRAM (watching a video on YouTube for instance) while waiting for the TF model to execute.

Then the application switches to another model and allocates a new session, the memory allocated by TF no longer has access to all of the 5.2Gb that it originally had, and you end up with an OOM condition.

Where is the tutorial about this use case?

@samhodge
Copy link
Author

samhodge commented May 1, 2019

@jvishnuvardhan Thank you for the articles on quantization, I am looking into this and I am also looking into use of TFLite on LInux and macOS, I am not sure how useful it is on windows.

@samhodge
Copy link
Author

samhodge commented May 1, 2019

@jvishnuvardhan as for using Stack Overflow

see this https://stackoverflow.com/questions/52683649/libtensorflow-cc-so-initialised-a-second-time-causes-segfault this cannot be fixed by anybody but the TF devs, it was ignored and as a result I cannot get my OFX plugin to run in Autodesk Flame 2020 which would be a sizable product user base, I reported in October 2018 Autodesk Flame was released in April 2019. There was no useful response from the Stack Overflow community or the TF devs.

@samhodge
Copy link
Author

samhodge commented May 1, 2019

@jvishnuvardhan

here is the TF team's response #22810

@samhodge
Copy link
Author

samhodge commented May 2, 2019

@jvishnuvardhan
Copy link
Contributor

@samhodge Thanks for sharing the resource. Thanks!

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label May 3, 2019
@samhodge
Copy link
Author

samhodge commented May 3, 2019

Here is another error report
‘’’
Rotobot: tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node xception_65/entry_flow/conv1_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 2, 2], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](xception_65/entry_flow/conv1_1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, xception_65/entry_flow/conv1_1/weights)]]
[[{{node SemanticPredictions/_45}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2428_SemanticPredictions", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
‘’’

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 17, 2019
@ZhuoranLyu
Copy link

@samhodge Exactly the same situation. Unless I am using a RTX2080Ti under windows 10. However, if I set this options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(fraction); in my code. It gives me the error

Error	LNK2019	unresolved external symbol "private: static class tensorflow::GPUOptions * __cdecl google::protobuf::Arena::CreateMaybeMessage(class google::protobuf::Arena *)" (??$CreateMaybeMessage@VGPUOptions@tensorflow@@$$V@Arena@protobuf@google@@CAPEAVGPUOptions@tensorflow@@PEAV012@@z) referenced in function "protected: static class tensorflow::GPUOptions * __cdecl google::protobuf::MessageLite::CreateMaybeMessage(class google::protobuf::Arena *)" (??$CreateMaybeMessage@VGPUOptions@tensorflow@@@MessageLite@protobuf@google@@KAPEAVGPUOptions@tensorflow@@PEAVArena@12@@z)	tftest

Any idea on this? I compiled from source.

@samhodge
Copy link
Author

Expose those symbols in your script of all the symbols you are exposing.

What project are you working towards?

@ZhuoranLyu
Copy link

@samhodge Hi Sam, I just replied you in another issue. Thank you for your reply.

@sushreebarsa
Copy link
Contributor

@samhodge We see that you are using old version of tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Hence moving this to closed status.Thanks!

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:runtime c++ runtime, performance issues (cpu) stat:awaiting tensorflower Status - Awaiting response from tensorflower subtype:windows Windows Build/Installation Issues type:bug Bug
Projects
None yet
Development

No branches or pull requests

7 participants