Replies: 3 comments 1 reply
-
Just a few quick thoughts,
In case of
Depends. If the code needs to modify both structs at the same time, then yes. If the code just stores a pointer (= a
TBH this sounds more like a workaround of a symptom than an actual problem/solution...
No surprise there. The big question is how difficult it is to find & fix these cases. I think at least it should be relatively easy to find (where the code unlocks just 1 lock out of 2).
Right.. if the two threads in the example used DAG lock order, they would still deadlock, yes ? OK perhaps i'm blind today but does DAG lock order bring some benefits over the existing lock order (object IDs) ? It seems to be a less generic solution, as it depends on both events being in DAG and able to determine the DAG order quickly. I'm not sure that would be possible in all cases. But if it's better in some significant way, we should consider it.
Right, for that purpose we already have macro Probably the main structural problem for your case is that a lot (most?) of PoCL code is written with the assumption that events will not fail to execute. And the OpenCL spec used to be a bit "underspecified" on what should happen in that case. |
Beta Was this translation helpful? Give feedback.
-
I had a look and it turns out it is only happening in the common.c broadcast function. However this function does call a number of other functions that in turn lock and unlock the wait events. Since the lock order with lock_events_in_order is only known during runtime, it is not easy to enforce the lock order in the nested functions. So we'd either need to have only one lock active at a time or make use of a DAG.
yes with a DAG that has l1->l2 this example would also deadlock since the lock order was not respected. The DAG makes it easier to see what the lock order is statically.
There basically is already one if you traverse the events in the notify list. So it'd be then first lock the notifier event, then the wait event.
Yes this would be pretty useful, it unfortunately is not used much (or maybe not at all) in the
Yes a lot of issues started coming out of the woodwork once a device is suddenly is not available. With normal devices this would be very uncommon, but with the addition of PoCL-R this is a real possibility. |
Beta Was this translation helpful? Give feedback.
-
I've made a couple of diagrams showing the locking behavior through different functions of just events. Fortunately, it is just broadcast and pocl_create_event_sync that make use of lock_events_inorder. I believe that if we adopt a policy of first locking the notifier event and then the wait event there, there should not be any direct event to event lock order violations. source files for obsidian canvas: |
Beta Was this translation helpful? Give feedback.
-
I've been trying to get pocl-(r) to work with an application that is more complex than the examples and I've been debugging the disconnect functionality. I've been able to fix a number of deadlocks so that it at least doesn't hang/crash the program (pr #1459), but there are still more before a program can continue to operate even if some devices are not available. There are a number of structural issues that I would like to start a discussion on.
1. The thread sanitizer throws a number of warnings
the remote driver creates a number of data race warnings, but that is a topic for another day. Instead there are a number of lock order inversion warnings between event objects and other objects that it has a reference of (e.g. memobj, command_queues, contexts). These primarily happen when an event is marked as failed while another thread is trying to create a command.
This leads to the question: "in what order should events and objects be locked?"
I'm personally leaning towards first event then object based on the release-event/command_queue/context code, but i'd love some feedback on this.
and similarly: "Do events and objects need to be locked at the same time?"
As an example, looking at the pocl_update_event_finished code, the pocl_free_event_memobjs function would not need to lock the event as well if a copy of the memobj list was first made, the original list freed and then unlock the event after which the memobjs were released.
2. The event.wait_list should not be used to interact with events in that list.
If events are chained correctly using the notifier lists, the results should be a directed acyclic graph (DAG). This is desirable because then there are not cycles and there is a clear order in which events should be locked. The wait_list allows code to traverse this graph in the opposite direction. This can lead to deadlocks when when an event further down the chain tries to lock and modify an event that is higher and already locked.
The primary use of the wait_list right now is to check if an event is waiting (by checking if its wait_list is NULL), but this can also be accomplished with a count that gets decremented every time a notifier event (an event higher up in the chain) notifies the waiting event that it is complete. The wait_list can be useful in debugging scenarios to see what events it is waiting on, but in general I think it is dangerous interact with the events in the wait_list.
3. Improper use of lock_in_order can still lead to deadlocks
lock_in_order is a function that will lock events based on their id number. It's purpose is to force a global lock order for all events and thereby preventing deadlocks. However it is still possible to get a deadlock. see the example below where l1 and l2 are mutexes and l1 has a lower id than l2.
lock(l1) in thread1 is a lock order violation since l2 was locked before lock(l1), so a solution could be to only use lock/unlock_in_order. The use of a DAG can help with defining the lock order, but can also turn into a deadlock if it is not respected. Some lock/unlock sequences could probably also be prevented by creating more versions of functions that do not lock by themselves but instead assume that the event is locked. For example, an internal clretainevent function that just increments the locked event reference count.
Beta Was this translation helpful? Give feedback.
All reactions