Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in test_multiprocessing_pool_circular_import in free-threaded build #118332

Closed
colesbury opened this issue Apr 26, 2024 · 0 comments
Closed
Labels
topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Apr 26, 2024

Bug report

I've observed an occasional deadlock involving the warnings mutex in test_multiprocessing_pool_circular_import. That test runs the following program, which involves both threads and multiprocessing.

import multiprocessing
import os
import threading
import traceback
def t():
try:
with multiprocessing.Pool(1):
pass
except Exception:
traceback.print_exc()
os._exit(1)
def main():
threads = []
for i in range(20):
threads.append(threading.Thread(target=t))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
if __name__ == "__main__":
main()

Backtraces

There is a deadlock between thread 2 and thread 5:

Thread 2 (stopping the world, waiting for warning lock via critical section resume)
[Switching to thread 2 (Thread 0x7f334c1d7640 (LWP 3070381))]
...
#6  0x000055fe4af3ec9d in _PySemaphore_PlatformWait (sema=sema@entry=0x7f334c1d5070, timeout=timeout@entry=-1) at Python/parking_lot.c:137
#7  0x000055fe4af3edd3 in _PySemaphore_Wait (sema=sema@entry=0x7f334c1d5070, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:206
#8  0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a338 <_PyRuntime+123768>, expected=expected@entry=0x7f334c1d510f, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f334c1d5110, detach=detach@entry=1) at Python/parking_lot.c:309
#9  0x000055fe4af2bed9 in _PyMutex_LockTimed (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:110
#10 0x000055fe4af2bfd9 in _PyMutex_LockSlow (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>) at Python/lock.c:53
#11 0x000055fe4aee8e1a in PyMutex_Lock (m=0x55fe4b29a338 <_PyRuntime+123768>) at ./Include/internal/pycore_lock.h:75
#12 _PyCriticalSection_Resume (tstate=tstate@entry=0x55fe4b8e82b0) at Python/critical_section.c:88
#13 0x000055fe4af543ac in _PyThreadState_Attach (tstate=tstate@entry=0x55fe4b8e82b0) at Python/pystate.c:2062
#14 0x000055fe4af07e84 in PyEval_AcquireThread (tstate=tstate@entry=0x55fe4b8e82b0) at Python/ceval_gil.c:568
#15 0x000055fe4af3ede2 in _PySemaphore_Wait (sema=sema@entry=0x7f334c1d5260, timeout=timeout@entry=1000000, detach=detach@entry=1) at Python/parking_lot.c:208
#16 0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a36c <_PyRuntime+123820>, expected=expected@entry=0x7f334c1d52e7, size=size@entry=1, timeout_ns=1000000, park_arg=park_arg@entry=0x0, detach=detach@entry=1) at Python/parking_lot.c:309
#17 0x000055fe4af2c1ee in PyEvent_WaitTimed (evt=evt@entry=0x55fe4b29a36c <_PyRuntime+123820>, timeout_ns=timeout_ns@entry=1000000) at Python/lock.c:300
#18 0x000055fe4af54aa5 in stop_the_world (stw=stw@entry=0x55fe4b29a368 <_PyRuntime+123816>) at Python/pystate.c:2230
#19 0x000055fe4af54b66 in _PyEval_StopTheWorld (interp=interp@entry=0x55fe4b299280 <_PyRuntime+119488>) at Python/pystate.c:2292
#20 0x000055fe4aefec20 in gc_collect_internal (interp=0x55fe4b299280 <_PyRuntime+119488>, state=state@entry=0x7f334c1d5390) at Python/gc_free_threading.c:1079
#21 0x000055fe4aefeecb in gc_collect_main (tstate=tstate@entry=0x55fe4b8e82b0, generation=generation@entry=0, reason=reason@entry=_Py_GC_REASON_HEAP) at Python/gc_free_threading.c:1174
#22 0x000055fe4aeff189 in _Py_RunGC (tstate=tstate@entry=0x55fe4b8e82b0) at Python/gc_free_threading.c:1631
#23 0x000055fe4af08627 in _Py_HandlePending (tstate=tstate@entry=0x55fe4b8e82b0) at Python/ceval_gil.c:1113
...
Thread 5 (holds warning lock via handoff, waiting for stop-the-world)
[Switching to thread 5 (Thread 0x7f334a9ca640 (LWP 3070385))]
...
#6  0x000055fe4af3ec9d in _PySemaphore_PlatformWait (sema=sema@entry=0x7f334a9c8b20, timeout=timeout@entry=-1) at Python/parking_lot.c:137
#7  0x000055fe4af3edd3 in _PySemaphore_Wait (sema=sema@entry=0x7f334a9c8b20, timeout=timeout@entry=-1, detach=detach@entry=0) at Python/parking_lot.c:206
#8  0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b8f6fd8, expected=expected@entry=0x7f334a9c8ba4, size=size@entry=4, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=0) at Python/parking_lot.c:309
#9  0x000055fe4af5217f in tstate_wait_attach (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/pystate.c:2024
#10 0x000055fe4af5439f in _PyThreadState_Attach (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/pystate.c:2052
#11 0x000055fe4af07e84 in PyEval_AcquireThread (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/ceval_gil.c:568
#12 0x000055fe4af3ede2 in _PySemaphore_Wait (sema=sema@entry=0x7f334a9c8c70, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:208
#13 0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a338 <_PyRuntime+123768>, expected=expected@entry=0x7f334a9c8d0f, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f334a9c8d10, detach=detach@entry=1) at Python/parking_lot.c:309
#14 0x000055fe4af2bed9 in _PyMutex_LockTimed (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:110
#15 0x000055fe4af2bfd9 in _PyMutex_LockSlow (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>) at Python/lock.c:53
#16 0x000055fe4aee8c1e in _PyCriticalSection_BeginSlow (c=c@entry=0x7f334a9c8dd0, m=0x55fe4b29a338 <_PyRuntime+123768>) at Python/critical_section.c:17
#17 0x000055fe4ae48aa3 in _PyCriticalSection_Begin (m=<optimized out>, c=0x7f334a9c8dd0) at ./Include/internal/pycore_critical_section.h:201
#18 do_warn (message='This process (pid=3070187) is multi-threaded, use of fork() may lead to deadlocks in the child.', category=<type at remote 0x55fe4b2451e0>, stack_level=1, source=0x0, skip_file_prefixes=skip_file_prefixes@entry=0x0) at Python/_warnings.c:1002
#19 0x000055fe4ae48c00 in warn_unicode (category=<optimized out>, category@entry=<type at remote 0x55fe4b2451e0>, message=message@entry='This process (pid=3070187) is multi-threaded, use of fork() may lead to deadlocks in the child.', stack_level=stack_level@entry=1, source=source@entry=0x0) at Python/_warnings.c:1203
#20 0x000055fe4ae48cbe in _PyErr_WarnFormatV (source=source@entry=0x0, category=<type at remote 0x55fe4b2451e0>, stack_level=stack_level@entry=1, format=format@entry=0x55fe4b0df438 "This process (pid=%d) is multi-threaded, use of %s() may lead to deadlocks in the child.", vargs=vargs@entry=0x7f334a9c8e80) at Python/_warnings.c:1223
#21 0x000055fe4ae49904 in PyErr_WarnFormat (category=<optimized out>, stack_level=stack_level@entry=1, format=format@entry=0x55fe4b0df438 "This process (pid=%d) is multi-threaded, use of %s() may lead to deadlocks in the child.") at Python/_warnings.c:1236
#22 0x000055fe4af9ba89 in warn_about_fork_with_threads (name=name@entry=0x55fe4b0dd837 "fork") at ./Modules/posixmodule.c:7893
#23 0x000055fe4afa481c in os_fork_impl (module=<optimized out>) at ./Modules/posixmodule.c:7993
#24 0x000055fe4afa483d in os_fork (module=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/clinic/posixmodule.c.h:4069

Explanation

There are two subtle things leading to the deadlock:

  1. Thread 2 is waiting for other threads to stop themselves using a PyEvent_WaitTimed call, which temporarily detaches and then re-attaches the thread state. In the process of re-attaching, the PyThreadState resumes the top most critical section, which in this case is trying to re-acquire the warnings mutex. (That's a bit weird!)
  2. Thread 5 holds the warning mutex. Normally, threads suspend their critical sections detaching but in this case the lock was handed off directly from some other thread to thread 5 1. Thread 5 didn't process that yet, so it doesn't "know" that it holds the mutex.

Possible fixes:

  1. Don't detach when calling PyEvent_WaitTimed() in the stop-the-world. Basically, add a variant of PyEvent_WaitTimed that accepts _Py_LOCK_DONT_DETACH so we never suspend/resume the critical section in the thread performing the stop the world.
  2. Insert a dummy critical section in stop_the_world (and remove it in start_the_world). The thread can still detach, but when it re-attaches it doesn't try to resume any critical sections (until it's done performing the stop-the-world).
  3. Avoid handing off locks directly. I'm pretty sure we want to keep supporting handoff, but I listed it here because I think it would also avoid the deadlock.

Both (1) and (2) seem reasonable to me. I'm not sure which is better.

Linked PRs

Footnotes

  1. PyMutex implements stochastic fairness. Basically, after waiting for a while, locks are handed off directly to the next waiter in line without unlocking the mutex so that eventually everyone gets a turn.

@colesbury colesbury added type-bug An unexpected behavior, bug, or error topic-free-threading labels Apr 26, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Apr 26, 2024
colesbury added a commit to colesbury/cpython that referenced this issue Apr 29, 2024
Avoid detaching thread state when stopping the world. When re-attaching
the thread state, the thread would attempt to resume the top-most
critical section, which might now be held by a thread paused for our
stop-the-world request.
colesbury added a commit that referenced this issue Apr 30, 2024
Avoid detaching thread state when stopping the world. When re-attaching
the thread state, the thread would attempt to resume the top-most
critical section, which might now be held by a thread paused for our
stop-the-world request.
SonicField pushed a commit to SonicField/cpython that referenced this issue May 8, 2024
Avoid detaching thread state when stopping the world. When re-attaching
the thread state, the thread would attempt to resume the top-most
critical section, which might now be held by a thread paused for our
stop-the-world request.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant