Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

Closed
ivankxt opened this issue Jun 12, 2023 · 8 comments
Closed

Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841

ivankxt opened this issue Jun 12, 2023 · 8 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype:centos Centos Build/Installation issues TF 2.2 Issues related to TF 2.2 type:build/install Build and install issues

Comments

@ivankxt
Copy link

ivankxt commented Jun 12, 2023

Click to expand!

Issue Type

Bug

Have you reproduced the bug with TF nightly?

Yes

Source

source

Tensorflow Version

tf2.2 + tfserving2.2

Custom Code

Yes

OS Platform and Distribution

centos7

Mobile device

No response

Python version

3.6

Bazel version

3.7.2

GCC/Compiler version

7.5

CUDA/cuDNN version

7.6.5

GPU model and memory

15G

Current Behaviour?

In our inference service, when executes the predict interface(predictor_->Predict(...)), it gets deadlock.

std::unique_ptr<tensorflow::serving::TensorflowPredictor> predictor_; predictor_->Predict(opt, core_.get(), predict_req, &predict_resp, run_metadata.get());

Here is the pstack:

obviously, it's in async execute, and waiting for something

Thread 167 (Thread 0x7f4b1bfa7700 (LWP 81084)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x00007f505feb1bbb in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) () from /home/qspace/upload/libtensorflow_serving.so
#2 0x00007f505feaedf9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter
, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#3 0x00007f505feafeeb in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_
, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#4 0x00007f505feb03c3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_
, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f506168349c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f50616834ed in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f5061693bb5 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f5061695dd5 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5061681313 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5067131cdc in tensorflow::serving::ServingSessionWrapper::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f506714092b in tensorflow::serving::internal::RunPredict(tensorflow::RunOptions const&, tensorflow::MetaGraphDef const&, tensorflow::serving::optional const&, tensorflow::serving::internal::PredictResponseTensorSerializationOption, tensorflow::Session*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5067131aa0 in tensorflow::serving::TensorflowPredictor::PredictWithModelSpec(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::ModelSpec const&, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f5067131c81 in tensorflow::serving::TensorflowPredictor::Predict(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x0000000002bd91b1 in mmfinderbd::RankModel::Predict (this=0x327bf080, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/models/tf/rank_tf_model.cpp:525
#15 0x0000000002b838cc in mmfinderbd::ServerCoreSingleModel::Predict (this=0x2c176aa0 mmfinderbd::ServerCore::Instance()::instance, req=..., resp=...) at bdegateway/mmfinder/mmfinderbdetfsvr/core/server_core_single_model.cpp:365
#16 0x0000000002b5df8b in MMFinderBdeTfSvrServiceImpl_PB::InferImpl (this=0x7f4ad84cbec0, head_uin=, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserviceimpl_pb.cpp:93
#17 0x0000000002b731d8 in MMFinderBdeTfSvrDispatcher_PB::Infer (this=this@entry=0x7f4ad84cbe60, uin=, req_buffer=req_buffer@entry=0x7f4ad84cb868, resp_buffer=resp_buffer@entry=0x7f4ad84cb870) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:1366
#18 0x0000000002b78569 in MMFinderBdeTfSvrDispatcher_PB::Dispatch (this=this@entry=0x7f4ad84cbe60) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:398
#19 0x0000000002b5a91e in MMFinderBdeTfSvrServer::SKServerProc (this=, ctrl_info=0x7f470421b820, ctx=0x7f470421b7a0, in_pkg=0x7f42681054a0, out_pkg=0x7f42681054e0, args=) at ./bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserver.h:44
#20 0x000000000671a6c8 in SMCoWorkerMt::CoWorkerIORun (this=0x330b4670, self=0x7f470421b6d0) at comm2/summer/smcoworker.cpp:1138
#21 0x0000000007cc679e in operator() (this=0x7f470421b908) at /home/mmdev/gcc7/lib/gcc/x86_64-pc-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/std_function.h:706
#22 CoRoutineFunc (co=0x7f470421b8f0) at basic/colib/co_routine.cpp:601
#23 0x0000000000000000 in ?? ()

and what does it exactly waiting for...

Thread 301 (Thread 0x7f4b09dcf700 (LWP 80869)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x0000000005bf5111 in WaitUntil (t=..., val=0, v=0x7f45f046b750) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:107
#2 absl::lts_2020_02_25::synchronization_internal::Waiter::Wait (this=this@entry=0x7f45f046b750, t=t@entry=...) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:151
#3 0x0000000005bf5052 in AbslInternalPerThreadSemWait (t=...) at mm3rd/abseil-cpp/absl/synchronization/internal/per_thread_sem.cc:93
#4 0x00007f5066f87b6d in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f5066f8889e in absl::Mutex::LockSlowLoop(absl::SynchWaitParams*, int) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f5066f88dc2 in absl::Mutex::LockSlowWithDeadline(absl::MuHowS const*, absl::Condition const*, absl::synchronization_internal::KernelTimeout, int) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f505f0897ec in absl::Mutex::LockSlow(absl::MuHowS const*, absl::Condition const*, int) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f50601c56fe in stream_executor::gpu::CUDABlas::DoBlasGemm(stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory, int) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5060290c13 in stream_executor::Stream::ThenBlasGemm(stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory
, int) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5063aebc4f in tensorflow::LaunchMatMul<Eigen::GpuDevice, float, true>::launch(tensorflow::OpKernelContext*, tensorflow::Tensor const&, tensorflow::Tensor const&, Eigen::array<Eigen::IndexPair, 1ul> const&, std::vector<long long, std::allocator >, bool, tensorflow::Tensor) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f5063aec42d in tensorflow::MatMulOp<Eigen::GpuDevice, float, true>::Compute(tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5061878296 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f50614a80bf in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x00007f50614a8c7f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue)::{lambda()#2}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#15 0x00007f506189a71f in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::ScheduleWithHint(std::function<void ()>, int, int) () from /home/qspace/upload/libtensorflow_serving.so
#16 0x00007f506189dd1b in tensorflow::thread::ThreadPool::Schedule(std::function<void ()>) () from /home/qspace/upload/libtensorflow_serving.so
#17 0x00007f5061681bb3 in std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&)::{lambda(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*)#7}::operator()(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*) const::{lambda(std::function<void ()>)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) () from /home/qspace/upload/libtensorflow_serving.so
#18 0x00007f506149ac84 in tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue) [clone .part.508] () from /home/qspace/upload/libtensorflow_serving.so
#19 0x00007f50614a3a84 in tensorflow::(anonymous namespace)::ExecutorState::NodeDone(tensorflow::Status const&, absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::NodeExecStatsInterface, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*) () from /home/qspace/upload/libtensorflow_serving.so
#20 0x00007f50614a8e4f in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long)::{lambda()#6}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#21 0x00007f50614fcbd0 in std::_Function_handler<void (tensorflow::Status const&), tensorflow::(anonymous namespace)::IntraProcessRecvAsyncImpl(tensorflow::DeviceMgr const*, tensorflow::LocalRendezvous*, tensorflow::RendezvousInterface::ParsedKey const&, tensorflow::RendezvousInterface::Args const&, std::function<void (tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)>)::{lambda(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)#2}::operator()(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)::{lambda(tensorflow::Status const&)#1}>::_M_invoke(std::_Any_data const&, tensorflow::Status const&) () from /home/qspace/upload/libtensorflow_serving.so
#22 0x00007f506186fa29 in tensorflow::GPUUtil::CopyCPUTensorToGPU(tensorflow::Tensor const*, tensorflow::DeviceContext const*, tensorflow::Device*, tensorflow::Tensor*, std::function<void (tensorflow::Status const&)>, bool)::{lambda()#2}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#23 0x00007f506189c3e1 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) () from /home/qspace/upload/libtensorflow_serving.so
#24 0x00007f50618990f3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#25 0x0000000007f0a9df in std::execute_native_thread_routine (__p=0x7f4a83d29490) at ../../../../../gcc-7.5.0/libstdc++-v3/src/c++11/thread.cc:83
#26 0x00007f505a533dc5 in start_thread () from /usr/lib64/libpthread.so.0
#27 0x00007f505993f74d in clone () from /usr/lib64/libc.so.6

GPU Info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.03 Driver Version: 525.116.03 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+====
| 0 Tesla T4 On | 00000000:00:0B.0 Off | 0 |
| N/A 52C P0 36W / 70W | 4695MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

Please give me some advice!

Standalone code to reproduce the issue

std::unique_ptr<tensorflow::serving::TensorflowPredictor> predictor_;
predictor_->Predict(opt, core_.get(), predict_req, &predict_resp, run_metadata.get());

Relevant log output

No response

@google-ml-butler google-ml-butler bot added the type:bug Bug label Jun 12, 2023
@sushreebarsa sushreebarsa added comp:gpu GPU related issues TF 2.2 Issues related to TF 2.2 labels Jun 13, 2023
@sushreebarsa
Copy link
Contributor

@ivankxt The TF v2.2 is an older version which is not actively supported. We would recommend you to kindly upgrade to the latest TF version and let us know if the issue still persists?
Thank you!

@sushreebarsa sushreebarsa added stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues subtype:centos Centos Build/Installation issues and removed comp:gpu GPU related issues type:bug Bug labels Jun 14, 2023
@github-actions
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jun 22, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@sushreebarsa sushreebarsa removed stat:awaiting response Status - Awaiting response from author stale This label marks the issue/pr stale - to be closed automatically if no activity labels Jul 12, 2023
@sushreebarsa sushreebarsa reopened this Jul 12, 2023
@sushreebarsa
Copy link
Contributor

sushreebarsa commented Jul 17, 2023

@ivankxt As per the documentation could you please try with the cudnn version of 7.4 which will be compatible with the cuda version that you are using? Also we recommend you to use the latest stable version as TF v2.2 is not actively supported. The issues would be lesser in the newer version as compared to the older version. please follow the instructions for gpu support as well. Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jul 17, 2023
@github-actions
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 25, 2023
@github-actions
Copy link

github-actions bot commented Aug 2, 2023

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

@github-actions github-actions bot closed this as completed Aug 2, 2023
@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype:centos Centos Build/Installation issues TF 2.2 Issues related to TF 2.2 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

2 participants