New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get deadlock after Predict(cuda10.0, cudnn7.6.5, Tesla T4 GPU) #60841
Comments
@ivankxt The TF v2.2 is an older version which is not actively supported. We would recommend you to kindly upgrade to the latest TF version and let us know if the issue still persists? |
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further. |
@ivankxt As per the documentation could you please try with the cudnn version of 7.4 which will be compatible with the cuda version that you are using? Also we recommend you to use the latest stable version as TF v2.2 is not actively supported. The issues would be lesser in the newer version as compared to the older version. please follow the instructions for gpu support as well. Thank you! |
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further. |
Click to expand!
Issue Type
Bug
Have you reproduced the bug with TF nightly?
Yes
Source
source
Tensorflow Version
tf2.2 + tfserving2.2
Custom Code
Yes
OS Platform and Distribution
centos7
Mobile device
No response
Python version
3.6
Bazel version
3.7.2
GCC/Compiler version
7.5
CUDA/cuDNN version
7.6.5
GPU model and memory
15G
Current Behaviour?
In our inference service, when executes the predict interface(predictor_->Predict(...)), it gets deadlock.
std::unique_ptr<tensorflow::serving::TensorflowPredictor> predictor_; predictor_->Predict(opt, core_.get(), predict_req, &predict_resp, run_metadata.get());
Here is the pstack:
obviously, it's in async execute, and waiting for something
Thread 167 (Thread 0x7f4b1bfa7700 (LWP 81084)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x00007f505feb1bbb in nsync::nsync_mu_semaphore_p_with_deadline(nsync::nsync_semaphore_s_, timespec) () from /home/qspace/upload/libtensorflow_serving.so
#2 0x00007f505feaedf9 in nsync::nsync_sem_wait_with_cancel_(nsync::waiter, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#3 0x00007f505feafeeb in nsync::nsync_cv_wait_with_deadline_generic(nsync::nsync_cv_s_, void*, void ()(void), void ()(void), timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#4 0x00007f505feb03c3 in nsync::nsync_cv_wait_with_deadline(nsync::nsync_cv_s_, nsync::nsync_mu_s_, timespec, nsync::nsync_note_s_) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f506168349c in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f50616834ed in tensorflow::DirectSession::WaitForNotification(tensorflow::Notification*, tensorflow::DirectSession::RunState*, tensorflow::CancellationManager*, long long) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f5061693bb5 in tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f5061695dd5 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata, tensorflow::thread::ThreadPoolOptions const&) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5061681313 in tensorflow::DirectSession::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5067131cdc in tensorflow::serving::ServingSessionWrapper::Run(tensorflow::RunOptions const&, std::vector<std::pair<std::string, tensorflow::Tensor>, std::allocator<std::pair<std::string, tensorflow::Tensor> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<std::string, std::allocatorstd::string > const&, std::vector<tensorflow::Tensor, std::allocatortensorflow::Tensor >, tensorflow::RunMetadata) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f506714092b in tensorflow::serving::internal::RunPredict(tensorflow::RunOptions const&, tensorflow::MetaGraphDef const&, tensorflow::serving::optional const&, tensorflow::serving::internal::PredictResponseTensorSerializationOption, tensorflow::Session*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5067131aa0 in tensorflow::serving::TensorflowPredictor::PredictWithModelSpec(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::ModelSpec const&, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f5067131c81 in tensorflow::serving::TensorflowPredictor::Predict(tensorflow::RunOptions const&, tensorflow::serving::ServerCore*, tensorflow::serving::PredictRequest const&, tensorflow::serving::PredictResponse*, tensorflow::RunMetadata*) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x0000000002bd91b1 in mmfinderbd::RankModel::Predict (this=0x327bf080, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/models/tf/rank_tf_model.cpp:525
#15 0x0000000002b838cc in mmfinderbd::ServerCoreSingleModel::Predict (this=0x2c176aa0 mmfinderbd::ServerCore::Instance()::instance, req=..., resp=...) at bdegateway/mmfinder/mmfinderbdetfsvr/core/server_core_single_model.cpp:365
#16 0x0000000002b5df8b in MMFinderBdeTfSvrServiceImpl_PB::InferImpl (this=0x7f4ad84cbec0, head_uin=, req=..., resp=0x7f47042735d0) at bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserviceimpl_pb.cpp:93
#17 0x0000000002b731d8 in MMFinderBdeTfSvrDispatcher_PB::Infer (this=this@entry=0x7f4ad84cbe60, uin=, req_buffer=req_buffer@entry=0x7f4ad84cb868, resp_buffer=resp_buffer@entry=0x7f4ad84cb870) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:1366
#18 0x0000000002b78569 in MMFinderBdeTfSvrDispatcher_PB::Dispatch (this=this@entry=0x7f4ad84cbe60) at bazel-out/cd7t-opt/genfiles/bdegateway/mmfinder/mmfinderbdetfsvr/skgenerated/sk_mmfinderbdetfsvrdispatcher.pb.cpp:398
#19 0x0000000002b5a91e in MMFinderBdeTfSvrServer::SKServerProc (this=, ctrl_info=0x7f470421b820, ctx=0x7f470421b7a0, in_pkg=0x7f42681054a0, out_pkg=0x7f42681054e0, args=) at ./bdegateway/mmfinder/mmfinderbdetfsvr/mmfinderbdetfsvrserver.h:44
#20 0x000000000671a6c8 in SMCoWorkerMt::CoWorkerIORun (this=0x330b4670, self=0x7f470421b6d0) at comm2/summer/smcoworker.cpp:1138
#21 0x0000000007cc679e in operator() (this=0x7f470421b908) at /home/mmdev/gcc7/lib/gcc/x86_64-pc-linux-gnu/7.5.0/../../../../include/c++/7.5.0/bits/std_function.h:706
#22 CoRoutineFunc (co=0x7f470421b8f0) at basic/colib/co_routine.cpp:601
#23 0x0000000000000000 in ?? ()
and what does it exactly waiting for...
Thread 301 (Thread 0x7f4b09dcf700 (LWP 80869)):
#0 0x00007f5059939c09 in syscall () from /usr/lib64/libc.so.6
#1 0x0000000005bf5111 in WaitUntil (t=..., val=0, v=0x7f45f046b750) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:107
#2 absl::lts_2020_02_25::synchronization_internal::Waiter::Wait (this=this@entry=0x7f45f046b750, t=t@entry=...) at mm3rd/abseil-cpp/absl/synchronization/internal/waiter.cc:151
#3 0x0000000005bf5052 in AbslInternalPerThreadSemWait (t=...) at mm3rd/abseil-cpp/absl/synchronization/internal/per_thread_sem.cc:93
#4 0x00007f5066f87b6d in absl::Mutex::Block(absl::base_internal::PerThreadSynch*) () from /home/qspace/upload/libtensorflow_serving.so
#5 0x00007f5066f8889e in absl::Mutex::LockSlowLoop(absl::SynchWaitParams*, int) () from /home/qspace/upload/libtensorflow_serving.so
#6 0x00007f5066f88dc2 in absl::Mutex::LockSlowWithDeadline(absl::MuHowS const*, absl::Condition const*, absl::synchronization_internal::KernelTimeout, int) () from /home/qspace/upload/libtensorflow_serving.so
#7 0x00007f505f0897ec in absl::Mutex::LockSlow(absl::MuHowS const*, absl::Condition const*, int) () from /home/qspace/upload/libtensorflow_serving.so
#8 0x00007f50601c56fe in stream_executor::gpu::CUDABlas::DoBlasGemm(stream_executor::Stream*, stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory, int) () from /home/qspace/upload/libtensorflow_serving.so
#9 0x00007f5060290c13 in stream_executor::Stream::ThenBlasGemm(stream_executor::blas::Transpose, stream_executor::blas::Transpose, unsigned long long, unsigned long long, unsigned long long, float, stream_executor::DeviceMemory const&, int, stream_executor::DeviceMemory const&, int, float, stream_executor::DeviceMemory, int) () from /home/qspace/upload/libtensorflow_serving.so
#10 0x00007f5063aebc4f in tensorflow::LaunchMatMul<Eigen::GpuDevice, float, true>::launch(tensorflow::OpKernelContext*, tensorflow::Tensor const&, tensorflow::Tensor const&, Eigen::array<Eigen::IndexPair, 1ul> const&, std::vector<long long, std::allocator >, bool, tensorflow::Tensor) () from /home/qspace/upload/libtensorflow_serving.so
#11 0x00007f5063aec42d in tensorflow::MatMulOp<Eigen::GpuDevice, float, true>::Compute(tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#12 0x00007f5061878296 in tensorflow::BaseGPUDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) () from /home/qspace/upload/libtensorflow_serving.so
#13 0x00007f50614a80bf in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long) () from /home/qspace/upload/libtensorflow_serving.so
#14 0x00007f50614a8c7f in std::_Function_handler<void (), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue)::{lambda()#2}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#15 0x00007f506189a71f in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::ScheduleWithHint(std::function<void ()>, int, int) () from /home/qspace/upload/libtensorflow_serving.so
#16 0x00007f506189dd1b in tensorflow::thread::ThreadPool::Schedule(std::function<void ()>) () from /home/qspace/upload/libtensorflow_serving.so
#17 0x00007f5061681bb3 in std::_Function_handler<void (std::function<void ()>), tensorflow::DirectSession::RunInternal(long long, tensorflow::RunOptions const&, tensorflow::CallFrameInterface*, tensorflow::DirectSession::ExecutorsAndKeys*, tensorflow::RunMetadata*, tensorflow::thread::ThreadPoolOptions const&)::{lambda(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*)#7}::operator()(tensorflow::DirectSession::PerPartitionExecutorsAndLib const&, tensorflow::Executor::Args*) const::{lambda(std::function<void ()>)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>&&) () from /home/qspace/upload/libtensorflow_serving.so
#18 0x00007f506149ac84 in tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue) [clone .part.508] () from /home/qspace/upload/libtensorflow_serving.so
#19 0x00007f50614a3a84 in tensorflow::(anonymous namespace)::ExecutorState::NodeDone(tensorflow::Status const&, absl::InlinedVector<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, 8ul, std::allocator<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode> >, tensorflow::NodeExecStatsInterface, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*) () from /home/qspace/upload/libtensorflow_serving.so
#20 0x00007f50614a8e4f in tensorflow::(anonymous namespace)::ExecutorState::Process(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long)::{lambda()#6}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#21 0x00007f50614fcbd0 in std::_Function_handler<void (tensorflow::Status const&), tensorflow::(anonymous namespace)::IntraProcessRecvAsyncImpl(tensorflow::DeviceMgr const*, tensorflow::LocalRendezvous*, tensorflow::RendezvousInterface::ParsedKey const&, tensorflow::RendezvousInterface::Args const&, std::function<void (tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)>)::{lambda(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)#2}::operator()(tensorflow::Status const&, tensorflow::RendezvousInterface::Args const&, tensorflow::RendezvousInterface::Args const&, tensorflow::Tensor const&, bool)::{lambda(tensorflow::Status const&)#1}>::_M_invoke(std::_Any_data const&, tensorflow::Status const&) () from /home/qspace/upload/libtensorflow_serving.so
#22 0x00007f506186fa29 in tensorflow::GPUUtil::CopyCPUTensorToGPU(tensorflow::Tensor const*, tensorflow::DeviceContext const*, tensorflow::Device*, tensorflow::Tensor*, std::function<void (tensorflow::Status const&)>, bool)::{lambda()#2}::operator()() const () from /home/qspace/upload/libtensorflow_serving.so
#23 0x00007f506189c3e1 in Eigen::ThreadPoolTempltensorflow::thread::EigenEnvironment::WorkerLoop(int) () from /home/qspace/upload/libtensorflow_serving.so
#24 0x00007f50618990f3 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/qspace/upload/libtensorflow_serving.so
#25 0x0000000007f0a9df in std::execute_native_thread_routine (__p=0x7f4a83d29490) at ../../../../../gcc-7.5.0/libstdc++-v3/src/c++11/thread.cc:83
#26 0x00007f505a533dc5 in start_thread () from /usr/lib64/libpthread.so.0
#27 0x00007f505993f74d in clone () from /usr/lib64/libc.so.6
GPU Info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.03 Driver Version: 525.116.03 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+====
| 0 Tesla T4 On | 00000000:00:0B.0 Off | 0 |
| N/A 52C P0 36W / 70W | 4695MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Please give me some advice!
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: