Deadlock in `test_multiprocessing_pool_circular_import` in free-threaded build #118332

colesbury · 2024-04-26T19:28:09Z

Bug report

I've observed an occasional deadlock involving the warnings mutex in test_multiprocessing_pool_circular_import. That test runs the following program, which involves both threads and multiprocessing.

cpython/Lib/test/test_importlib/partial/pool_in_threads.py

Lines 1 to 27 in 5a90de0

    
           import multiprocessing 
        
           import os 
        
           import threading 
        
           import traceback 
        
           def t(): 
        
               try: 
        
                   with multiprocessing.Pool(1): 
        
                       pass 
        
               except Exception: 
        
                   traceback.print_exc() 
        
                   os._exit(1) 
        
           def main(): 
        
               threads = [] 
        
               for i in range(20): 
        
                   threads.append(threading.Thread(target=t)) 
        
               for thread in threads: 
        
                   thread.start() 
        
               for thread in threads: 
        
                   thread.join() 
        
           if __name__ == "__main__": 
        
               main()

Backtraces

There is a deadlock between thread 2 and thread 5:

Thread 2 (stopping the world, waiting for warning lock via critical section resume)

[Switching to thread 2 (Thread 0x7f334c1d7640 (LWP 3070381))]
...
#6  0x000055fe4af3ec9d in _PySemaphore_PlatformWait (sema=sema@entry=0x7f334c1d5070, timeout=timeout@entry=-1) at Python/parking_lot.c:137
#7  0x000055fe4af3edd3 in _PySemaphore_Wait (sema=sema@entry=0x7f334c1d5070, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:206
#8  0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a338 <_PyRuntime+123768>, expected=expected@entry=0x7f334c1d510f, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f334c1d5110, detach=detach@entry=1) at Python/parking_lot.c:309
#9  0x000055fe4af2bed9 in _PyMutex_LockTimed (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:110
#10 0x000055fe4af2bfd9 in _PyMutex_LockSlow (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>) at Python/lock.c:53
#11 0x000055fe4aee8e1a in PyMutex_Lock (m=0x55fe4b29a338 <_PyRuntime+123768>) at ./Include/internal/pycore_lock.h:75
#12 _PyCriticalSection_Resume (tstate=tstate@entry=0x55fe4b8e82b0) at Python/critical_section.c:88
#13 0x000055fe4af543ac in _PyThreadState_Attach (tstate=tstate@entry=0x55fe4b8e82b0) at Python/pystate.c:2062
#14 0x000055fe4af07e84 in PyEval_AcquireThread (tstate=tstate@entry=0x55fe4b8e82b0) at Python/ceval_gil.c:568
#15 0x000055fe4af3ede2 in _PySemaphore_Wait (sema=sema@entry=0x7f334c1d5260, timeout=timeout@entry=1000000, detach=detach@entry=1) at Python/parking_lot.c:208
#16 0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a36c <_PyRuntime+123820>, expected=expected@entry=0x7f334c1d52e7, size=size@entry=1, timeout_ns=1000000, park_arg=park_arg@entry=0x0, detach=detach@entry=1) at Python/parking_lot.c:309
#17 0x000055fe4af2c1ee in PyEvent_WaitTimed (evt=evt@entry=0x55fe4b29a36c <_PyRuntime+123820>, timeout_ns=timeout_ns@entry=1000000) at Python/lock.c:300
#18 0x000055fe4af54aa5 in stop_the_world (stw=stw@entry=0x55fe4b29a368 <_PyRuntime+123816>) at Python/pystate.c:2230
#19 0x000055fe4af54b66 in _PyEval_StopTheWorld (interp=interp@entry=0x55fe4b299280 <_PyRuntime+119488>) at Python/pystate.c:2292
#20 0x000055fe4aefec20 in gc_collect_internal (interp=0x55fe4b299280 <_PyRuntime+119488>, state=state@entry=0x7f334c1d5390) at Python/gc_free_threading.c:1079
#21 0x000055fe4aefeecb in gc_collect_main (tstate=tstate@entry=0x55fe4b8e82b0, generation=generation@entry=0, reason=reason@entry=_Py_GC_REASON_HEAP) at Python/gc_free_threading.c:1174
#22 0x000055fe4aeff189 in _Py_RunGC (tstate=tstate@entry=0x55fe4b8e82b0) at Python/gc_free_threading.c:1631
#23 0x000055fe4af08627 in _Py_HandlePending (tstate=tstate@entry=0x55fe4b8e82b0) at Python/ceval_gil.c:1113
...

Thread 5 (holds warning lock via handoff, waiting for stop-the-world)

[Switching to thread 5 (Thread 0x7f334a9ca640 (LWP 3070385))]
...
#6  0x000055fe4af3ec9d in _PySemaphore_PlatformWait (sema=sema@entry=0x7f334a9c8b20, timeout=timeout@entry=-1) at Python/parking_lot.c:137
#7  0x000055fe4af3edd3 in _PySemaphore_Wait (sema=sema@entry=0x7f334a9c8b20, timeout=timeout@entry=-1, detach=detach@entry=0) at Python/parking_lot.c:206
#8  0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b8f6fd8, expected=expected@entry=0x7f334a9c8ba4, size=size@entry=4, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=0) at Python/parking_lot.c:309
#9  0x000055fe4af5217f in tstate_wait_attach (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/pystate.c:2024
#10 0x000055fe4af5439f in _PyThreadState_Attach (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/pystate.c:2052
#11 0x000055fe4af07e84 in PyEval_AcquireThread (tstate=tstate@entry=0x55fe4b8f6fb0) at Python/ceval_gil.c:568
#12 0x000055fe4af3ede2 in _PySemaphore_Wait (sema=sema@entry=0x7f334a9c8c70, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:208
#13 0x000055fe4af3ef5c in _PyParkingLot_Park (addr=addr@entry=0x55fe4b29a338 <_PyRuntime+123768>, expected=expected@entry=0x7f334a9c8d0f, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f334a9c8d10, detach=detach@entry=1) at Python/parking_lot.c:309
#14 0x000055fe4af2bed9 in _PyMutex_LockTimed (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:110
#15 0x000055fe4af2bfd9 in _PyMutex_LockSlow (m=m@entry=0x55fe4b29a338 <_PyRuntime+123768>) at Python/lock.c:53
#16 0x000055fe4aee8c1e in _PyCriticalSection_BeginSlow (c=c@entry=0x7f334a9c8dd0, m=0x55fe4b29a338 <_PyRuntime+123768>) at Python/critical_section.c:17
#17 0x000055fe4ae48aa3 in _PyCriticalSection_Begin (m=<optimized out>, c=0x7f334a9c8dd0) at ./Include/internal/pycore_critical_section.h:201
#18 do_warn (message='This process (pid=3070187) is multi-threaded, use of fork() may lead to deadlocks in the child.', category=<type at remote 0x55fe4b2451e0>, stack_level=1, source=0x0, skip_file_prefixes=skip_file_prefixes@entry=0x0) at Python/_warnings.c:1002
#19 0x000055fe4ae48c00 in warn_unicode (category=<optimized out>, category@entry=<type at remote 0x55fe4b2451e0>, message=message@entry='This process (pid=3070187) is multi-threaded, use of fork() may lead to deadlocks in the child.', stack_level=stack_level@entry=1, source=source@entry=0x0) at Python/_warnings.c:1203
#20 0x000055fe4ae48cbe in _PyErr_WarnFormatV (source=source@entry=0x0, category=<type at remote 0x55fe4b2451e0>, stack_level=stack_level@entry=1, format=format@entry=0x55fe4b0df438 "This process (pid=%d) is multi-threaded, use of %s() may lead to deadlocks in the child.", vargs=vargs@entry=0x7f334a9c8e80) at Python/_warnings.c:1223
#21 0x000055fe4ae49904 in PyErr_WarnFormat (category=<optimized out>, stack_level=stack_level@entry=1, format=format@entry=0x55fe4b0df438 "This process (pid=%d) is multi-threaded, use of %s() may lead to deadlocks in the child.") at Python/_warnings.c:1236
#22 0x000055fe4af9ba89 in warn_about_fork_with_threads (name=name@entry=0x55fe4b0dd837 "fork") at ./Modules/posixmodule.c:7893
#23 0x000055fe4afa481c in os_fork_impl (module=<optimized out>) at ./Modules/posixmodule.c:7993
#24 0x000055fe4afa483d in os_fork (module=<optimized out>, _unused_ignored=<optimized out>) at ./Modules/clinic/posixmodule.c.h:4069

Explanation

There are two subtle things leading to the deadlock:

Thread 2 is waiting for other threads to stop themselves using a PyEvent_WaitTimed call, which temporarily detaches and then re-attaches the thread state. In the process of re-attaching, the PyThreadState resumes the top most critical section, which in this case is trying to re-acquire the warnings mutex. (That's a bit weird!)
Thread 5 holds the warning mutex. Normally, threads suspend their critical sections detaching but in this case the lock was handed off directly from some other thread to thread 5 ¹. Thread 5 didn't process that yet, so it doesn't "know" that it holds the mutex.

Possible fixes:

Don't detach when calling PyEvent_WaitTimed() in the stop-the-world. Basically, add a variant of PyEvent_WaitTimed that accepts _Py_LOCK_DONT_DETACH so we never suspend/resume the critical section in the thread performing the stop the world.
Insert a dummy critical section in stop_the_world (and remove it in start_the_world). The thread can still detach, but when it re-attaches it doesn't try to resume any critical sections (until it's done performing the stop-the-world).
Avoid handing off locks directly. I'm pretty sure we want to keep supporting handoff, but I listed it here because I think it would also avoid the deadlock.

Both (1) and (2) seem reasonable to me. I'm not sure which is better.

Linked PRs

gh-118332: Fix deadlock involving stop the world #118412

PyMutex implements stochastic fairness. Basically, after waiting for a while, locks are handed off directly to the next waiter in line without unlocking the mutex so that eventually everyone gets a turn. ↩

The text was updated successfully, but these errors were encountered:

Avoid detaching thread state when stopping the world. When re-attaching the thread state, the thread would attempt to resume the top-most critical section, which might now be held by a thread paused for our stop-the-world request.

colesbury added type-bug An unexpected behavior, bug, or error topic-free-threading labels Apr 26, 2024

colesbury added a commit to colesbury/cpython that referenced this issue Apr 26, 2024

pythongh-118332: Fix deadlock involving stop the world

7a96f50

bedevere-app bot mentioned this issue Apr 29, 2024

gh-118332: Fix deadlock involving stop the world #118412

Merged

colesbury closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in `test_multiprocessing_pool_circular_import` in free-threaded build #118332

Deadlock in `test_multiprocessing_pool_circular_import` in free-threaded build #118332

colesbury commented Apr 26, 2024 •

edited by bedevere-app bot

Deadlock in test_multiprocessing_pool_circular_import in free-threaded build #118332

Deadlock in test_multiprocessing_pool_circular_import in free-threaded build #118332

Comments

colesbury commented Apr 26, 2024 • edited by bedevere-app bot

Bug report

Backtraces

Explanation

Possible fixes:

Linked PRs

Footnotes

Deadlock in `test_multiprocessing_pool_circular_import` in free-threaded build #118332

Deadlock in `test_multiprocessing_pool_circular_import` in free-threaded build #118332

colesbury commented Apr 26, 2024 •

edited by bedevere-app bot