New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profiling hangs in cuda/cupti .so #62614
Comments
@annaa-ka, Thank you! |
I do understand your concern about the versions. But there are two main issues Firstly, simple example from here works fine (https://pastebin.com/r1Nz1qeS). Secondly, if i try to profile my model not from the first batch (earlier I put profile_batch=(1, 2), now changed to (2, 3)), I get another trace at all (https://pastebin.com/dfsgcAKY) Can it be still relevant and why do i have such different traces in both cases? |
Hi @annaa-ka, Thanks for filing this. Just to confirm if it is a tensorboard profiler issue: When does the stack trace get dumped? Does enabling tensorboard profiler throws a SIGABRT or SIGSEGV? Or the hang is followed by any external signal to kill the job? |
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further. |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
2.6.2
Custom code
Yes
OS platform and distribution
Ubuntu 20.04.6 LTS
Mobile device
No response
Python version
Python 3.8.10
Bazel version
bazel 5.2.0-1
GCC/compiler version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CUDA/cuDNN version
CUDA Version: 12.0
GPU model and memory
Ubuntu 18.04.6 LTS
Current behavior?
Hi, i am using tensorflow profiler to profile train of my model. Before the train starts I get the following lines
However when the real training starts it hangs.
I’ve got such backtrace
UPD. i tried using strace -f -p my_pid
I found a lot of lines like
lsof -p showed that this fd refer to /dev/nvidia0
It seems that libcuda.so.1 and libcupti.so.11.1 were created without debug symbols and are NVIDIA property, so are there any ways to find out what happens?
Standalone code to reproduce the issue
The main problem is that when I am running profiling with simple model from the Internet everything works, but we have our model, which I currently do not fully understand and want with backtraces understand what happens there
Relevant log output
No response
The text was updated successfully, but these errors were encountered: