Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault with TensorRT create interference graph #27100

Closed
isra60 opened this issue Mar 25, 2019 · 7 comments
Closed

Segmentation Fault with TensorRT create interference graph #27100

isra60 opened this issue Mar 25, 2019 · 7 comments
Assignees
Labels
comp:apis Highlevel API related issues contrib Anything that comes under contrib directory type:bug Bug

Comments

@isra60
Copy link

isra60 commented Mar 25, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

  • OS Platform and Distribution: Linux Ubuntu 18.04

  • TensorFlow installed from (source or binary): bynary (tensorflow-gpu)

  • TensorFlow version (use command below): b'v1.13.1-0-g6612da8951' 1.13.1

  • Python version: Python 3.6.7

  • CUDA/cuDNN version: CUDA 10

  • GPU model and memory: NVIDIA 1060 GTX

Describe the current behavior
i'm trying to optimize a tensorflow model to tensort optimization. i'm using the example of object detection given by https://github.com/tensorflow/tensorrt/tree/master/tftrt/examples/object_detection. So the tensorflow model loads perfect but when I try to optimize it a segmentation fault raise.

Describe the expected behavior

Code to reproduce the issue
with tf.Graph().as_default() as tf_graph: with tf.Session(config=tf_config) as tf_sess: frozen_graph = trt.create_inference_graph( input_graph_def=frozen_graph, outputs=output_names, max_batch_size=max_batch_size, max_workspace_size_bytes=max_workspace_size_bytes, precision_mode=precision_mode, minimum_segment_size=minimum_segment_size, is_dynamic_op=True, maximum_cached_engines=maximum_cached_engines)

So the segmentation fault occurs in trt create_create_interference_graph.
Other info / logs

This is the log from python output

2019-03-25 09:08:39.360172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-25 09:08:39.360201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-25 09:08:39.360207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-25 09:08:39.360210: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-25 09:08:39.360303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5171 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Running against TensorRT version 5.0.2
INFO:tensorflow:Running against TensorRT version 5.0.2
2019-03-25 09:08:40.787773: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 1
2019-03-25 09:08:40.788522: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-03-25 09:08:40.790765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-25 09:08:40.790785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-25 09:08:40.790790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-03-25 09:08:40.790793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-03-25 09:08:40.790903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5171 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2019-03-25 09:08:42.079562: I tensorflow/contrib/tensorrt/segment/segment.cc:443] There are 2316 ops of 32 different types in the graph that are not converted to TensorRT: Fill, Switch, Range, TopKV2, ConcatV2, Identity, Squeeze, Transpose, Const, Unpack, ResizeBilinear, Reshape, Mul, Slice, Merge, Split, Where, ExpandDims, NonMaxSuppressionV3, GatherV2, Cast, Greater, Minimum, Sub, ZerosLike, Pack, Exp, Placeholder, Add, Shape, NoOp, StridedSlice, (For more information see https://docs.nvidia.com/deeplearning/dgx/integrate-tf-trt/index.html#support-ops).
2019-03-25 09:08:42.206925: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:913] Number of TensorRT candidate segments: 185
2019-03-25 09:08:47.654116: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 486 nodes succeeded.
Segmentation fault (core dumped)

And this is the callstack from gdb .

2019-03-25 09:12:23.651268: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:1015] TensorRT node TRTEngineOp_0 added for segment 0 consisting of 486 nodes succeeded.

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007fff68d60261 in tensorflow::tensorrt::convert::GetDeviceAndAllocator(tensorflow::tensorrt::convert::ConversionParams const&, tensorflow::tensorrt::convert::EngineInfo const&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so
(gdb) bt
#0 0x00007fff68d60261 in tensorflow::tensorrt::convert::GetDeviceAndAllocator(tensorflow::tensorrt::convert::ConversionParams const&, tensorflow::tensorrt::convert::EngineInfo const&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so
#1 0x00007fff68d651aa in tensorflow::tensorrt::convert::ConvertAfterShapes(tensorflow::tensorrt::convert::ConversionParams&) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so
#2 0x00007fff68d90f56 in tensorflow::tensorrt::convert::TRTOptimizationPass::Optimize(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/tensorrt/_wrap_conversion.so
#3 0x00007fffb549a8ee in tensorflow::grappler::MetaOptimizer::RunOptimizer(tensorflow::grappler::GraphOptimizer*, tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem*, tensorflow::GraphDef*, tensorflow::grappler::MetaOptimizer::GraphOptimizationResult*) () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#4 0x00007fffb549b552 in tensorflow::grappler::MetaOptimizer::OptimizeGraph(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#5 0x00007fffb549c8a7 in tensorflow::grappler::MetaOptimizer::Optimize(tensorflow::grappler::Cluster*, tensorflow::grappler::GrapplerItem const&, tensorflow::GraphDef*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#6 0x00007fffb028ab9c in TF_OptimizeGraph(GCluster, tensorflow::ConfigProto const&, tensorflow::MetaGraphDef const&, bool, std::string const&, TF_Status*) ()
from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#7 0x00007fffb0293157 in _wrap_TF_OptimizeGraph () from /usr/local/lib/python3.6/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#8 0x0000000000502d6f in ?? ()
#9 0x0000000000506859 in _PyEval_EvalFrameDefault ()
#10 0x0000000000504c28 in ?? ()
#11 0x0000000000502540 in ?? ()
#12 0x0000000000502f3d in ?? ()
#13 0x0000000000507641 in _PyEval_EvalFrameDefault ()
#14 0x0000000000504c28 in ?? ()
#15 0x0000000000502540 in ?? ()
#16 0x0000000000502f3d in ?? ()
#17 0x0000000000507641 in _PyEval_EvalFrameDefault ()
#18 0x0000000000504c28 in ?? ()
#19 0x0000000000502540 in ?? ()
#20 0x0000000000502f3d in ?? ()
#21 0x0000000000507641 in _PyEval_EvalFrameDefault ()
#22 0x0000000000504c28 in ?? ()
#23 0x0000000000506393 in PyEval_EvalCode ()
#24 0x0000000000634d52 in ?? ()
#25 0x00000000004a38c5 in ?? ()
#26 0x00000000004a5cd5 in PyRun_InteractiveLoopFlags ()
#27 0x00000000006387b3 in PyRun_AnyFileExFlags ()
#28 0x000000000063915a in Py_Main ()
#29 0x00000000004a6f10 in main ()

@jvishnuvardhan jvishnuvardhan self-assigned this Mar 28, 2019
@jvishnuvardhan jvishnuvardhan added contrib Anything that comes under contrib directory comp:apis Highlevel API related issues type:bug Bug labels Mar 28, 2019
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 28, 2019
@aaroey
Copy link
Member

aaroey commented May 6, 2019

@isra60 I was not able to reproduce. Could you add more description about how you setup the tensorflow/tensorrt repository locally? Thanks.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 7, 2019
@huaifeng1993
Copy link

huaifeng1993 commented Jul 16, 2019

@aaroey I have the same problem。 have you solved it?

@huaifeng1993
Copy link

huaifeng1993 commented Jul 16, 2019

Here is my informations.

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04

  • TensorFlow installed from source:git clone from github

  • TensorFlow version : tensorflow 1.13.0rc0

  • Python version: Python 3.5.6

  • CUDA/cuDNN version: CUDA 10.0.130 cuDNN V7.3.1

  • GPU model and memory: NVIDIA GTX TITAN XP

TensorRt5.0.2

@aaroey
Copy link
Member

aaroey commented Aug 14, 2019

@isra60 @huaifeng1993 could you try to use tf-nightly-gpu and see if it can reproduce?

@suneeta-mall
Copy link

@aaroey I have similar issue and same stack with Tf2.1 but works with tf2.2 dev builds. Do you know of a change that might have gone in to address that?

@aaroey aaroey assigned bixia1 and unassigned aaroey Apr 15, 2020
@tensorflowbutler
Copy link
Member

Hi There,

We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.

This issue will be closed automatically 7 days from now. If you still need help with this issue, Please open a new issue for any help you need against 2.x, and we will get you the right help.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues contrib Anything that comes under contrib directory type:bug Bug
Projects
None yet
Development

No branches or pull requests

7 participants