Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

ivankxt · 2023-06-12T13:10:12Z

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

tf2.2 + tfserving2.2

Custom Code

Yes

OS Platform and Distribution

centos7

Mobile device

No response

Python version

3.6

Bazel version

3.7.2

GCC/Compiler version

7.5

CUDA/cuDNN version

7.6.5

GPU model and memory

15G

Current Behaviour?

In our inference service, when executes the predict interface(predictor_->Predict(...)), it gets deadlock.

std::unique_ptr<tensorflow::serving::TensorflowPredictor> predictor_; predictor_->Predict(opt, core_.get(), predict_req, &predict_resp, run_metadata.get());

Here is the pstack：

obviously, it's in async execute, and waiting for something

Thread 167 (Thread 0x7f4b1bfa7700 (LWP 81084)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x00007f505feb1bbb in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) () from /home/qspace/upload/libtensorflow_serving.so
#2 0x00007f505feaedf9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#3 0x00007f505feafeeb in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#4 0x00007f505feb03c3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f506168349c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f50616834ed in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f5061693bb5 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f5061695dd5 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5061681313 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5067131cdc in tensorflow::serving::ServingSessionWrapper::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f506714092b in tensorflow::serving::internal::RunPredict(tensorflow::RunOptions const&, tensorflow::MetaGraphDef const&, tensorflow::serving::optional const&, tensorflow::serving::internal::PredictResponseTensorSerializationOption, tensorflow::Session*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5067131aa0 in tensorflow::serving::TensorflowPredictor::PredictWithModelSpec(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::ModelSpec const&, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f5067131c81 in tensorflow::serving::TensorflowPredictor::Predict(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x0000000002bd91b1 in mmfinderbd::RankModel::Predict (this=0x327bf080, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/models/tf/rank_tf_model.cpp:525
#15 0x0000000002b838cc in mmfinderbd::ServerCoreSingleModel::Predict (this=0x2c176aa0 mmfinderbd::ServerCore::Instance()::instance, req=..., resp=...) at bdegateway/mmfinder/mmfinderbdetfsvr/core/server_core_single_model.cpp:365
#16 0x0000000002b5df8b in MMFinderBdeTfSvrServiceImpl_PB::InferImpl (this=0x7f4ad84cbec0, head_uin=, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserviceimpl_pb.cpp:93
#17 0x0000000002b731d8 in MMFinderBdeTfSvrDispatcher_PB::Infer (this=this@entry=0x7f4ad84cbe60, uin=, req_buffer=req_buffer@entry=0x7f4ad84cb868, resp_buffer=resp_buffer@entry=0x7f4ad84cb870) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:1366
#18 0x0000000002b78569 in MMFinderBdeTfSvrDispatcher_PB::Dispatch (this=this@entry=0x7f4ad84cbe60) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:398
#19 0x0000000002b5a91e in MMFinderBdeTfSvrServer::SKServerProc (this=, ctrl_info=0x7f470421b820, ctx=0x7f470421b7a0, in_pkg=0x7f42681054a0, out_pkg=0x7f42681054e0, args=) at ./bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserver.h:44
#20 0x000000000671a6c8 in SMCoWorkerMt::CoWorkerIORun (this=0x330b4670, self=0x7f470421b6d0) at comm2/summer/smcoworker.cpp:1138
#21 0x0000000007cc679e in operator() (this=0x7f470421b908) at /home/mmdev/gcc7/lib/gcc/x86_64-pc-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/std_function.h:706
#22 CoRoutineFunc (co=0x7f470421b8f0) at basic/colib/co_routine.cpp:601
#23 0x0000000000000000 in ?? ()

and what does it exactly waiting for...

Thread 301 (Thread 0x7f4b09dcf700 (LWP 80869)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x0000000005bf5111 in WaitUntil (t=..., val=0, v=0x7f45f046b750) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:107
#2 absl::lts_2020_02_25::synchronization_internal::Waiter::Wait (this=this@entry=0x7f45f046b750, t=t@entry=...) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:151
#3 0x0000000005bf5052 in AbslInternalPerThreadSemWait (t=...) at mm3rd/abseil-cpp/absl/synchronization/internal/per_thread_sem.cc:93
#4 0x00007f5066f87b6d in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f5066f8889e in absl::Mutex::LockSlowLoop(absl::SynchWaitParams*, int) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f5066f88dc2 in absl::Mutex::LockSlowWithDeadline(absl::MuHowS const*, absl::Condition const*, absl::synchronization_internal::KernelTimeout, int) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f505f0897ec in absl::Mutex::LockSlow(absl::MuHowS const*, absl::Condition const*, int) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f50601c56fe in stream_executor::gpu::CUDABlas::DoBlasGemm(stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory, int) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5060290c13 in stream_executor::Stream::ThenBlasGemm(stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory, int) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5063aebc4f in tensorflow::LaunchMatMul<Eigen::GpuDevice, float, true>::launch(tensorflow::OpKernelContext*, tensorflow::Tensor const&, tensorflow::Tensor const&, Eigen::array<Eigen::IndexPair, 1ul> const&, std::vector<long long, std::allocator >, bool, tensorflow::Tensor) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f5063aec42d in tensorflow::MatMulOp<Eigen::GpuDevice, float, true>::Compute(tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5061878296 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f50614a80bf in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x00007f50614a8c7f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue)::{lambda()#2}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#15 0x00007f506189a71f in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::ScheduleWithHint(std::function<void ()>, int, int) () from /home/qspace/upload/libtensorflow_serving.so
#16 0x00007f506189dd1b in tensorflow::thread::ThreadPool::Schedule(std::function<void ()>) () from /home/qspace/upload/libtensorflow_serving.so
#17 0x00007f5061681bb3 in std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&)::{lambda(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*)#7}::operator()(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*) const::{lambda(std::function<void ()>)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) () from /home/qspace/upload/libtensorflow_serving.so
#18 0x00007f506149ac84 in tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue) [clone .part.508] () from /home/qspace/upload/libtensorflow_serving.so
#19 0x00007f50614a3a84 in tensorflow::(anonymous namespace)::ExecutorState::NodeDone(tensorflow::Status const&, absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::NodeExecStatsInterface, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*) () from /home/qspace/upload/libtensorflow_serving.so
#20 0x00007f50614a8e4f in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long)::{lambda()#6}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#21 0x00007f50614fcbd0 in std::_Function_handler<void (tensorflow::Status const&), tensorflow::(anonymous namespace)::IntraProcessRecvAsyncImpl(tensorflow::DeviceMgr const*, tensorflow::LocalRendezvous*, tensorflow::RendezvousInterface::ParsedKey const&, tensorflow::RendezvousInterface::Args const&, std::function<void (tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)>)::{lambda(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)#2}::operator()(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)::{lambda(tensorflow::Status const&)#1}>::_M_invoke(std::_Any_data const&, tensorflow::Status const&) () from /home/qspace/upload/libtensorflow_serving.so
#22 0x00007f506186fa29 in tensorflow::GPUUtil::CopyCPUTensorToGPU(tensorflow::Tensor const*, tensorflow::DeviceContext const*, tensorflow::Device*, tensorflow::Tensor*, std::function<void (tensorflow::Status const&)>, bool)::{lambda()#2}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#23 0x00007f506189c3e1 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) () from /home/qspace/upload/libtensorflow_serving.so
#24 0x00007f50618990f3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#25 0x0000000007f0a9df in std::execute_native_thread_routine (__p=0x7f4a83d29490) at ../../../../../gcc-7.5.0/libstdc++-v3/src/c++11/thread.cc:83
#26 0x00007f505a533dc5 in start_thread () from /usr/lib64/libpthread.so.0
#27 0x00007f505993f74d in clone () from /usr/lib64/libc.so.6

Please give me some advice!

Standalone code to reproduce the issue

std::unique_ptr<tensorflow::serving::TensorflowPredictor> predictor_;
predictor_->Predict(opt, core_.get(), predict_req, &predict_resp, run_metadata.get());

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

sushreebarsa · 2023-06-14T06:42:38Z

@ivankxt The TF v2.2 is an older version which is not actively supported. We would recommend you to kindly upgrade to the latest TF version and let us know if the issue still persists?
Thank you!

github-actions · 2023-06-22T02:01:20Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-06-29T02:06:11Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2023-06-29T02:06:14Z

Are you satisfied with the resolution of your issue?
Yes
No

sushreebarsa · 2023-07-17T09:24:12Z

@ivankxt As per the documentation could you please try with the cudnn version of 7.4 which will be compatible with the cuda version that you are using? Also we recommend you to use the latest stable version as TF v2.2 is not actively supported. The issues would be lesser in the newer version as compared to the older version. please follow the instructions for gpu support as well. Thank you!

github-actions · 2023-07-25T02:03:09Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2023-08-02T01:49:47Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2023-08-02T01:49:52Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:bug Bug label Jun 12, 2023

google-ml-butler bot assigned sushreebarsa Jun 12, 2023

sushreebarsa added comp:gpu GPU related issues TF 2.2 Issues related to TF 2.2 labels Jun 13, 2023

sushreebarsa added stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues subtype:centos Centos Build/Installation issues and removed comp:gpu GPU related issues type:bug Bug labels Jun 14, 2023

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 22, 2023

github-actions bot closed this as completed Jun 29, 2023

sushreebarsa removed stat:awaiting response Status - Awaiting response from author stale This label marks the issue/pr stale - to be closed automatically if no activity labels Jul 12, 2023

sushreebarsa reopened this Jul 12, 2023

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jul 17, 2023

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 25, 2023

github-actions bot closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

ivankxt commented Jun 12, 2023 •

edited

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

sushreebarsa commented Jun 14, 2023

github-actions bot commented Jun 22, 2023

github-actions bot commented Jun 29, 2023

google-ml-butler bot commented Jun 29, 2023

sushreebarsa commented Jul 17, 2023 •

edited

github-actions bot commented Jul 25, 2023

github-actions bot commented Aug 2, 2023

google-ml-butler bot commented Aug 2, 2023

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

Comments

ivankxt commented Jun 12, 2023 • edited

Issue Type

Have you reproduced the bug with TF nightly?

Source

Tensorflow Version

Custom Code

OS Platform and Distribution

Mobile device

Python version

Bazel version

GCC/Compiler version

CUDA/cuDNN version

GPU model and memory

Current Behaviour?

Standalone code to reproduce the issue

Relevant log output

sushreebarsa commented Jun 14, 2023

github-actions bot commented Jun 22, 2023

github-actions bot commented Jun 29, 2023

google-ml-butler bot commented Jun 29, 2023

sushreebarsa commented Jul 17, 2023 • edited

github-actions bot commented Jul 25, 2023

github-actions bot commented Aug 2, 2023

google-ml-butler bot commented Aug 2, 2023

ivankxt commented Jun 12, 2023 •

edited

sushreebarsa commented Jul 17, 2023 •

edited