Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we should phread_atfork around heap profiler lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

Open
kvignesh1420 opened this issue Aug 30, 2023 · 2 comments

Comments

@kvignesh1420
Copy link

Setup: I am using gperftools 2.11 for heap profiling of tensorflow 2.11 training jobs on a RHEL 7.9 machine.

Observation: When tensorflow ops are being compiled, the main process is creating an llvm thread group and using them for parallel compilation of the ops. In my setup, I observed that the only child process being created via fork is hanging when tcmalloc+heap profiling is enabled.

The back trace for the parent process is shown below

#0  0x00007fa2ca2bba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa2c9582aec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib64/libstdc++.so.6
#2  0x00007fa2a7f2339b in llvm::ThreadPool::wait(llvm::ThreadPoolTaskGroup&) ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007fa2a73a1e6c in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) ()

The back trace for the child process is shown below

#0  0x00007f5f54e7ae29 in syscall () from /usr/lib64/libc.so.6
#1  0x00007f5f562b8cb0 in base::internal::SpinLockDelay (w=w@entry=0x7f5f5667adb0 <heap_lock>, value=2, loop=loop@entry=31771) at ./src/base/spinlock_linux-inl.h:86
#2  0x00007f5f562b8b67 in SpinLock::SlowLock (this=this@entry=0x7f5f5667adb0 <heap_lock>) at src/base/spinlock.cc:134
#3  0x00007f5f562b3f4a in SpinLock::Lock (this=0x7f5f5667adb0 <heap_lock>) at src/base/spinlock.h:71
#4  SpinLockHolder::SpinLockHolder (l=0x7f5f5667adb0 <heap_lock>, this=<synthetic pointer>) at src/base/spinlock.h:123
#5  RecordAlloc (skip_count=0, bytes=16, ptr=0x4a06b960) at src/heap-profiler.cc:319
#6  NewHook (ptr=0x4a06b960, size=16) at src/heap-profiler.cc:341
#7  0x00007f5f562aec02 in MallocHook::InvokeNewHookSlow (p=p@entry=0x4a06b960, s=s@entry=16) at src/malloc_hook.cc:314
#8  0x00007f5f562bafa4 in MallocHook::InvokeNewHook (s=16, p=0x4a06b960) at src/malloc_hook-inl.h:133
#9  tcmalloc::do_allocate_full<tcmalloc::cpp_throw_oom> (size=16) at src/tcmalloc.cc:1808
#10 tcmalloc::allocate_full_cpp_throw_oom (size=16) at src/tcmalloc.cc:1818
#11 0x00007f5ef29028b1 in arrow::util::(anonymous namespace)::AfterForkState::AfterFork() ()
   from /opt/site-packages/pyarrow/libarrow.so.900
#12 0x00007f5f54e47c4e in fork () from /usr/lib64/libc.so.6
#13 0x00007f5f54e70830 in __spawni () from /usr/lib64/libc.so.6
#14 0x00007f5f54e707b0 in posix_spawnp@@GLIBC_2.15 () from /usr/lib64/libc.so.6
#15 0x00007f5f3389bf6d in tsl::SubProcess::Start() ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f5f33546975 in stream_executor::CompileGpuAsm(int, int, char const*, stream_executor::GpuAsmOpts) ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
@alk
Copy link
Contributor

alk commented Aug 31, 2023

So this is issue with fork and pthread_atfork interaction. I will save all the details and complexities of atfork stuff, but here is what matters specifically in your case:

a) TF is calling (the only) right API: posix_spawn to spawn child process. But sadly for some unknown reason glibc until 2.24 had "broken" implementation that forked (instead of vfork or rather clone_vfork it should have used and which is now used). This lack of "properness" of posix_spawn in your older version of glibc is what is triggering all the mess.

b) pthread_atfork is itself super tricky thing in many cases. Google's internal policy for example is to never use it. There is internal paper with details, but somehow I am not able to find public version. Some of paper's arguments are imho not right, but main point is sound. That is: different libraries/modules lock & state nestings won't always match nestings of atfork handlers established at runtime causing deadlocks. For some reason that arrow thingy choose to employ atfork stuff. Perhaps it is occasionally used in forked settings rather than threaded settings (I guess because python is only able to offer parallelism with fork is what addes demand to this questionable feature). Then their atfork "after" handler calls into tcmalloc. You're likely using libtcmalloc LD_PRELOAD-ed (not much else makes sense). And we do some atfork business, but only for our main locks (yes, despite some arguments that maybe we shouldnt). So it would have worked, but we don't do usual atfork dance for heap profiler's heap_lock thingy. And this is where child's after handler finds heap_lock arbitrarily "broken" and hangs.

So we could have our atfork handling amended to also do the locking around heap profiler lock. Alternatively, you can avoid all the trouble by having libc which has right posix_spawn implementation. I have checked that RHEL 8 does. If upgrading to rhel8 or later isn't an option, then consider "stealing" right posix_spawn implementation from either modern glibc or from musl.

@kvignesh1420
Copy link
Author

Thanks for the insight @alk

@alk alk changed the title LLVM thread(s) hang after fork from the parent process we should phread_atfork around heap profiler lock(s) (was: LLVM thread(s) hang after fork from the parent process) Sep 1, 2023
@alk alk reopened this Sep 1, 2023
@alk alk added the enhancement label Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants