Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling hangs in cuda/cupti .so #62614

Closed
annaa-ka opened this issue Dec 11, 2023 · 6 comments
Closed

Profiling hangs in cuda/cupti .so #62614

annaa-ka opened this issue Dec 11, 2023 · 6 comments
Assignees
Labels
2.6.0 comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug

Comments

@annaa-ka
Copy link

annaa-ka commented Dec 11, 2023

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.6.2

Custom code

Yes

OS platform and distribution

Ubuntu 20.04.6 LTS

Mobile device

No response

Python version

Python 3.8.10

Bazel version

bazel 5.2.0-1

GCC/compiler version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

CUDA/cuDNN version

CUDA Version: 12.0

GPU model and memory

Ubuntu 18.04.6 LTS

Current behavior?

Hi, i am using tensorflow profiler to profile train of my model. Before the train starts I get the following lines

2023-12-10 16:31:15.256811: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2023-12-10 16:31:15.257177: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 8 GPUs
2023-12-10 16:31:15.669753: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2023-12-10 16:31:15.670454: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
[2023-12-10 16:31:15,691 | trajectory_predictor.neural.models | 33727 | INFO] Tensorboard logs will be available in /tmp/tmp9vfwctuu_tb_logs

However when the real training starts it hangs.

2023-12-10 16:31:18.057702: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2023-12-10 16:31:18.058010: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.

I’ve got such backtrace

#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x895a2a0) at pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0x895a2a0) at pthread_rwlock_wrlock.c:27
#3  0x00007fdbd52fa258 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fdbd523fcc1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fd95a8bc01a in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#6  0x00007fd95a8ba35c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#7  0x00007fd95a89ae62 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#8  0x00007fd95a8979b2 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#9  0x00007fd95a89891b in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#10 0x00007fd95a86aa86 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#11 0x00007fd95a86acf8 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#12 0x00007fd95a86be6c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#13 0x00007fdbd5058b5b in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007fdbd52ff6a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007fdbd502c7a6 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#16 0x00007fdbd502e792 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#17 0x00007fdbd512f2ca in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#18 0x00007fdc6a1281cb in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#19 0x00007fdc6a16b7e6 in cudaLaunchKernel () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#20 0x00007fdc7f7b6987 in ?? () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007fdc7f7b8715 in tensorflow::functor::FillFunctor<Eigen::GpuDevice, float>::operator()(Eigen::GpuDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>) () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

UPD. i tried using strace -f -p my_pid
I found a lot of lines like

 poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}, {fd=26, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLI

lsof -p showed that this fd refer to /dev/nvidia0

It seems that libcuda.so.1 and libcupti.so.11.1 were created without debug symbols and are NVIDIA property, so are there any ways to find out what happens?

Standalone code to reproduce the issue

The main problem is that when I am running profiling with simple model from the Internet everything works, but we have our model, which I currently do not fully understand and want with backtraces understand what happens there

Relevant log output

No response

@tilakrayal
Copy link
Contributor

@annaa-ka,
Tensorflow-2.6.0 is compatible with python 3.6-3.9, compiler GCC 7.3.1, Bazel 3.7.2, cuDNN 8.1 and CUDA 11.2 versions. In your case, most of the configurations are on the higher side. Could you please try to follow the tested build configurations and install the tensorflow for the smooth execution.
https://www.tensorflow.org/install/source#gpu

Thank you!

@tilakrayal tilakrayal added the stat:awaiting response Status - Awaiting response from author label Dec 12, 2023
@annaa-ka
Copy link
Author

@annaa-ka, Tensorflow-2.6.0 is compatible with python 3.6-3.9, compiler GCC 7.3.1, Bazel 3.7.2, cuDNN 8.1 and CUDA 11.2 versions. In your case, most of the configurations are on the higher side. Could you please try to follow the tested build configurations and install the tensorflow for the smooth execution. https://www.tensorflow.org/install/source#gpu

Thank you!

I do understand your concern about the versions. But there are two main issues

Firstly, simple example from here works fine (https://pastebin.com/r1Nz1qeS).

Secondly, if i try to profile my model not from the first batch (earlier I put profile_batch=(1, 2), now changed to (2, 3)), I get another trace at all (https://pastebin.com/dfsgcAKY)
So, I looked for same issues and found out (#12667 (comment))

Can it be still relevant and why do i have such different traces in both cases?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Dec 13, 2023
@tilakrayal tilakrayal added the comp:apis Highlevel API related issues label Dec 18, 2023
@sachinprasadhs sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 27, 2023
@rengoog
Copy link

rengoog commented Jan 11, 2024

Hi @annaa-ka,

Thanks for filing this.

Just to confirm if it is a tensorboard profiler issue: When does the stack trace get dumped? Does enabling tensorboard profiler throws a SIGABRT or SIGSEGV? Or the hang is followed by any external signal to kill the job?

@sachinprasadhs sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jan 11, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jan 19, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6.0 comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants