New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing on a in tensorflow_cc.so on Windows 7 on Quadro R5000 16Gb with v1.12 and CUDA 10.0.130 and CUDNN 7.4.2.24 OK under Windows 10 Quadro P5000 and GTX 1060 6Gb #27441
Comments
OK walking backwards through the call stack I can see this could have been raised from Actually a grep is easier
I am surprised I am not getting any error messages out of https://github.com/tensorflow/tensorflow/blob/v1.12.0/tensorflow/stream_executor/cuda/cuda_driver.cc The verbosity of the application is turned down to error only, maybe I can supply the client with a build with maximum verbosity and see if that helps trace the error. |
OK my action plan for now is to build a version with maximum debug and see what is happening and what is not happening. Thanks for listening. sam |
I got a new error report With error level 3 only tfSession->Run failed: Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. |
Everything points to a runtime environment problem #24828 (comment) Is it possible to write a diagnostic tool which prints the path and versions of CuDNN and CUDA in the client’s runtime environment? Sam |
It all leads back to cudnn64_7.dll not having an overly specific name see https://en.wikipedia.org/wiki/DLL_Hell |
Could be related to this. #24496 |
The suggested solution on that ticket is to allow growth https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth But this is a performance decision |
Just confirmed this code:
This results in a properly calculated graph
|
Here is the log without set allow growth
|
With allow growth enabled, uses Shared GPU memory as well as Dedicated GPU memory.
|
Looks like this tool http://docs.nvidia.com/cuda/cuda-memcheck/index.html Can be useful if I can instruct the clients about how to use it |
With allow growth turned on there is nothing insightful from cuda-memtool see
|
Without memory growth on I am finally able to get a trace from Natron
|
Interestingly running without cuda-memcheck.exe doesnt result in a crash. but the program is in a zombie state after closing
Sorry the process does complete after being in a suspended state for a few seconds. |
I made this for the clients to help with debugging on their systems so I may as well leave it here The result will be watermarked this might be informative I made a video to help you get better debug information https://www.youtube.com/watch?v=oWULIoJlrto Download and install Natron so we have a known quantity with OpenFX hosts. Close all other applications Then follow the tutorial about how to open Natron from the command prompt Then playback a few frames and see if it crashes. The footage used in the clip is here: The debug installer is here: If or when it does crash if you can say what you did before it crashing and provided the information from the command prompt, just highlight and use Ctrl-C and Ctrl-V to put it into a text document There are subtitles on the YouTube video, the audio isnt very good. |
the trace says before it that allow growth was turned off.
I cannot see your comment on GitHub.
…On Thu, Apr 11, 2019 at 6:19 AM Daniel Bryce Evans ***@***.***> wrote:
@samhodge <https://github.com/samhodge> You were able to get a trace from
cuda-memcheck with allow_growth on or off?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27441 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFTooTPVAwS_C2e6GzlzYD2OW-t7-nkks5vfk5lgaJpZM4cZdwc>
.
--
Sam Hodge
Director
Kognat Proprietary Limited
ACN 623 943 304
Mobile : +61417801006
sam@kognat.com
|
Memory management in TF is the greatest cause of bugs in my application, any help would be useful for the entire community |
@samhodge There are several tutorials listed on TF website to deal about the memory management. Please take a look at them. There are articles on best practices of Tensorflow lite. You could search for similar articles on the internet. This is not Build/Installation or Bug/Performance issue. Please post this kind of support questions at Stackoverflow. There is a big community to support and learn from your questions. GitHub is mainly for addressing bugs in installation and performance. Thanks! |
The only relevant article is this https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth which is not in the tutorials. I have set the value to 0.95 percentage of memory available for allocation by testing using CUDA device API. When other applications get into this GPU memory space it results in a segfault. This can be done easily by opening a browser or similar. This makes my customers upset. There is no way to deallocate TF memory apart from session->reset() which doesn't actually work. If memory management is so well documented why are capable C++ coders having issues with it? |
The use case is as follows, there are several models that are run in the one C++ software application. to create the next session a singleton is used for the session and a new model is allocated but only the first allocation will set aside the VRAM to be used. So in 5.2Gb of VRAM is allocated and one model is running and no other application uses that VRAM everything is OK. Then you switch a new model, there is some memory fragmentation and the other application on the host machine allocate some VRAM (watching a video on YouTube for instance) while waiting for the TF model to execute. Then the application switches to another model and allocates a new session, the memory allocated by TF no longer has access to all of the 5.2Gb that it originally had, and you end up with an OOM condition. Where is the tutorial about this use case? |
@jvishnuvardhan Thank you for the articles on quantization, I am looking into this and I am also looking into use of TFLite on LInux and macOS, I am not sure how useful it is on windows. |
@jvishnuvardhan as for using Stack Overflow see this https://stackoverflow.com/questions/52683649/libtensorflow-cc-so-initialised-a-second-time-causes-segfault this cannot be fixed by anybody but the TF devs, it was ignored and as a result I cannot get my OFX plugin to run in Autodesk Flame 2020 which would be a sizable product user base, I reported in October 2018 Autodesk Flame was released in April 2019. There was no useful response from the Stack Overflow community or the TF devs. |
here is the TF team's response #22810 |
https://github.com/miglopst/cs263_spring2018/wiki/Memory-management-for-tensorflow Is the most detailed reference |
@samhodge Thanks for sharing the resource. Thanks! |
Here is another error report |
@samhodge Exactly the same situation. Unless I am using a RTX2080Ti under windows 10. However, if I set this
Any idea on this? I compiled from source. |
Expose those symbols in your script of all the symbols you are exposing. What project are you working towards? |
@samhodge Hi Sam, I just replied you in another issue. Thank you for your reply. |
@samhodge We see that you are using old version of tensorflow which is officially considered as end of life, We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions .Please open a new issue in case you face any errors, we will get you the right help .Hence moving this to closed status.Thanks! |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
I have linked against the //tensorflow:libtensorflow_cc.so and //tensorflow:libtensorflow_framework.so targets using other libs, abseil-cpp, libprotobuf etc
Windows 10 (build) and Window 7 (deployment)
(tensorflow-cuda10) C:\Users\user\dev\tensorflow-cuda10\tensorflow\tensorflow\core\common_runtime\gpu>python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'v1.12.0-0-ga6d8ffae09' 1.12.0
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
(tensorflow-cuda10) C:\Users\user\dev\tensorflow-cuda10\tensorflow\tensorflow\core\common_runtime\gpu>python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
b'v1.12.0-0-ga6d8ffae09' 1.12.0
Describe the current behavior
The application is currently crashed when initialising the session on the Quadro card on the client's computer running Windows 7 with the error messsage:
2019-04-02 11:30:18.871580: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
here is the code for that file, where LOG FATAL is line 274
Describe the expected behavior
I would expect the software to load the graph into a fresh session and compute
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
tensorflow::SessionOptions options;
tensorflow::ConfigProto* config = &options.config;
options.config.mutable_gpu_options()->set_per_process_gpu_memory_fraction(0.9);
device_count->insert({ "GPU",1});
}
device_count->insert({ "CPU", 1 });
//bytes is read from graph_file_name
graph_def->ParseFromArray(bytes.data(), (int)bytes.size()))
session>reset(tensorflow::NewSession(options);
std::cout << "Rotobot: Swapping to model: " << graph_file_name << " using a single model per render is more efficent" << std::endl;
//crashes after here
auto status = (*session)->Create(graph_def);
auto status2 = (*session)->Run(Input_Tensors);
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
You can download the built software from:
https://kognat.com/product/rotobot-openfx-plugin-windows-64-gpu-v1-2-0-rc2-cuda-10/
You will just need an OpenFX host like Natron
https://natrongithub.github.io/
This tutorial will give you reproduction steps
https://kognat.com/2019/03/28/rotobot-srgb/
The text was updated successfully, but these errors were encountered: