Profiling hangs in cuda/cupti .so #62614

annaa-ka · 2023-12-11T09:57:00Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.6.2

Custom code

Yes

OS platform and distribution

Ubuntu 20.04.6 LTS

Mobile device

No response

Python version

Python 3.8.10

Bazel version

bazel 5.2.0-1

GCC/compiler version

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

CUDA/cuDNN version

CUDA Version: 12.0

GPU model and memory

Ubuntu 18.04.6 LTS

Current behavior?

Hi, i am using tensorflow profiler to profile train of my model. Before the train starts I get the following lines

2023-12-10 16:31:15.256811: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
2023-12-10 16:31:15.257177: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1614] Profiler found 8 GPUs
2023-12-10 16:31:15.669753: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.
2023-12-10 16:31:15.670454: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1748] CUPTI activity buffer flushed
[2023-12-10 16:31:15,691 | trajectory_predictor.neural.models | 33727 | INFO] Tensorboard logs will be available in /tmp/tmp9vfwctuu_tb_logs

However when the real training starts it hangs.

2023-12-10 16:31:18.057702: I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
2023-12-10 16:31:18.058010: I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.

I’ve got such backtrace

#0  futex_abstimed_wait (private=0, abstime=0x0, clockid=0, expected=2, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:284
#1  __pthread_rwlock_wrlock_full (abstime=0x0, clockid=0, rwlock=0x895a2a0) at pthread_rwlock_common.c:830
#2  __GI___pthread_rwlock_wrlock (rwlock=0x895a2a0) at pthread_rwlock_wrlock.c:27
#3  0x00007fdbd52fa258 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007fdbd523fcc1 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#5  0x00007fd95a8bc01a in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#6  0x00007fd95a8ba35c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#7  0x00007fd95a89ae62 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#8  0x00007fd95a8979b2 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#9  0x00007fd95a89891b in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#10 0x00007fd95a86aa86 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#11 0x00007fd95a86acf8 in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#12 0x00007fd95a86be6c in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcupti.so.11.1
#13 0x00007fdbd5058b5b in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#14 0x00007fdbd52ff6a0 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#15 0x00007fdbd502c7a6 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#16 0x00007fdbd502e792 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#17 0x00007fdbd512f2ca in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#18 0x00007fdc6a1281cb in ?? () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#19 0x00007fdc6a16b7e6 in cudaLaunchKernel () from /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
#20 0x00007fdc7f7b6987 in ?? () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#21 0x00007fdc7f7b8715 in tensorflow::functor::FillFunctor<Eigen::GpuDevice, float>::operator()(Eigen::GpuDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>) () from /usr/lib/python3/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so

UPD. i tried using strace -f -p my_pid
I found a lot of lines like

 poll([{fd=20, events=POLLIN}, {fd=22, events=POLLIN}, {fd=24, events=POLLIN}, {fd=26, events=POLLIN}, {fd=28, events=POLLIN}, {fd=29, events=POLLI

lsof -p showed that this fd refer to /dev/nvidia0

It seems that libcuda.so.1 and libcupti.so.11.1 were created without debug symbols and are NVIDIA property, so are there any ways to find out what happens?

Standalone code to reproduce the issue

The main problem is that when I am running profiling with simple model from the Internet everything works, but we have our model, which I currently do not fully understand and want with backtraces understand what happens there

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

tilakrayal · 2023-12-12T09:11:38Z

@annaa-ka,
Tensorflow-2.6.0 is compatible with python 3.6-3.9, compiler GCC 7.3.1, Bazel 3.7.2, cuDNN 8.1 and CUDA 11.2 versions. In your case, most of the configurations are on the higher side. Could you please try to follow the tested build configurations and install the tensorflow for the smooth execution.
https://www.tensorflow.org/install/source#gpu

Thank you!

annaa-ka · 2023-12-13T08:29:07Z

@annaa-ka, Tensorflow-2.6.0 is compatible with python 3.6-3.9, compiler GCC 7.3.1, Bazel 3.7.2, cuDNN 8.1 and CUDA 11.2 versions. In your case, most of the configurations are on the higher side. Could you please try to follow the tested build configurations and install the tensorflow for the smooth execution. https://www.tensorflow.org/install/source#gpu

Thank you!

I do understand your concern about the versions. But there are two main issues

Firstly, simple example from here works fine (https://pastebin.com/r1Nz1qeS).

Secondly, if i try to profile my model not from the first batch (earlier I put profile_batch=(1, 2), now changed to (2, 3)), I get another trace at all (https://pastebin.com/dfsgcAKY)
So, I looked for same issues and found out (#12667 (comment))

Can it be still relevant and why do i have such different traces in both cases?

rengoog · 2024-01-11T21:51:41Z

Hi @annaa-ka,

Thanks for filing this.

Just to confirm if it is a tensorboard profiler issue: When does the stack trace get dumped? Does enabling tensorboard profiler throws a SIGABRT or SIGSEGV? Or the hang is followed by any external signal to kill the job?

github-actions · 2024-01-19T01:49:20Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-01-27T01:46:22Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-01-27T01:46:25Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:bug Bug label Dec 11, 2023

google-ml-butler bot assigned tilakrayal Dec 11, 2023

tilakrayal added the 2.6.0 label Dec 12, 2023

tilakrayal added the stat:awaiting response Status - Awaiting response from author label Dec 12, 2023

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Dec 13, 2023

rajasekharporeddy assigned rajasekharporeddy and tilakrayal and unassigned tilakrayal and rajasekharporeddy Dec 14, 2023

tilakrayal added the comp:apis Highlevel API related issues label Dec 18, 2023

tilakrayal assigned sachinprasadhs and unassigned tilakrayal Dec 18, 2023

sachinprasadhs added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Dec 27, 2023

sachinprasadhs added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jan 11, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jan 19, 2024

github-actions bot closed this as completed Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling hangs in cuda/cupti .so #62614

Profiling hangs in cuda/cupti .so #62614

annaa-ka commented Dec 11, 2023 •

edited

tilakrayal commented Dec 12, 2023

annaa-ka commented Dec 13, 2023

rengoog commented Jan 11, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 27, 2024

google-ml-butler bot commented Jan 27, 2024

Profiling hangs in cuda/cupti .so #62614

Profiling hangs in cuda/cupti .so #62614

Comments

annaa-ka commented Dec 11, 2023 • edited

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

tilakrayal commented Dec 12, 2023

annaa-ka commented Dec 13, 2023

rengoog commented Jan 11, 2024

github-actions bot commented Jan 19, 2024

github-actions bot commented Jan 27, 2024

google-ml-butler bot commented Jan 27, 2024

annaa-ka commented Dec 11, 2023 •

edited