Discussion on how should locks be applied in PoCL #1463

RABijl · 2024-05-10T08:08:35Z

RABijl
May 10, 2024

I've been trying to get pocl-(r) to work with an application that is more complex than the examples and I've been debugging the disconnect functionality. I've been able to fix a number of deadlocks so that it at least doesn't hang/crash the program (pr #1459), but there are still more before a program can continue to operate even if some devices are not available. There are a number of structural issues that I would like to start a discussion on.

1. The thread sanitizer throws a number of warnings

the remote driver creates a number of data race warnings, but that is a topic for another day. Instead there are a number of lock order inversion warnings between event objects and other objects that it has a reference of (e.g. memobj, command_queues, contexts). These primarily happen when an event is marked as failed while another thread is trying to create a command.

This leads to the question: "in what order should events and objects be locked?"
I'm personally leaning towards first event then object based on the release-event/command_queue/context code, but i'd love some feedback on this.
and similarly: "Do events and objects need to be locked at the same time?"
As an example, looking at the pocl_update_event_finished code, the pocl_free_event_memobjs function would not need to lock the event as well if a copy of the memobj list was first made, the original list freed and then unlock the event after which the memobjs were released.

2. The event.wait_list should not be used to interact with events in that list.

If events are chained correctly using the notifier lists, the results should be a directed acyclic graph (DAG). This is desirable because then there are not cycles and there is a clear order in which events should be locked. The wait_list allows code to traverse this graph in the opposite direction. This can lead to deadlocks when when an event further down the chain tries to lock and modify an event that is higher and already locked.
The primary use of the wait_list right now is to check if an event is waiting (by checking if its wait_list is NULL), but this can also be accomplished with a count that gets decremented every time a notifier event (an event higher up in the chain) notifies the waiting event that it is complete. The wait_list can be useful in debugging scenarios to see what events it is waiting on, but in general I think it is dangerous interact with the events in the wait_list.

3. Improper use of lock_in_order can still lead to deadlocks

lock_in_order is a function that will lock events based on their id number. It's purpose is to force a global lock order for all events and thereby preventing deadlocks. However it is still possible to get a deadlock. see the example below where l1 and l2 are mutexes and l1 has a lower id than l2.

thread 1	thread 2	comment
lock_in_order(l1,l2)		gets both locks
	lock_in_order(l1,l2)	waits for l1 to be unlocked
unlock(l1)		done in the code in order to call a function that needs l1 to be unlocked
	lock(l1)	happens in lock_in_order
	lock(l2)	lock_in_order blocks trying to lock l2
lock(l1)		deadlock !

lock(l1) in thread1 is a lock order violation since l2 was locked before lock(l1), so a solution could be to only use lock/unlock_in_order. The use of a DAG can help with defining the lock order, but can also turn into a deadlock if it is not respected. Some lock/unlock sequences could probably also be prevented by creating more versions of functions that do not lock by themselves but instead assume that the event is locked. For example, an internal clretainevent function that just increments the locked event reference count.

franz · 2024-05-10T14:24:24Z

franz
May 10, 2024
Maintainer

Just a few quick thoughts,

"in what order should events and objects be locked?"

In case of cl_mem, IIRC it should be event then cl_mem but i'm not 100% sure. In case of queue and an event, the locking order was queue then event. With other cl_objects there wasn't any order defined.

"Do events and objects need to be locked at the same time?"

Depends. If the code needs to modify both structs at the same time, then yes. If the code just stores a pointer (= a cl_object) into another struct cl_object then locking the latter might be enough - in many cases refcounting ensures we don't end up with dangling pointer.

The event.wait_list should not be used to interact with events in that list
in general I think it is dangerous interact with the events in the wait_list.

TBH this sounds more like a workaround of a symptom than an actual problem/solution...

Improper use of lock_in_order can still lead to deadlocks

No surprise there. The big question is how difficult it is to find & fix these cases. I think at least it should be relatively easy to find (where the code unlocks just 1 lock out of 2).

The use of a DAG can help with defining the lock order, but can also turn into a deadlock if it is not respected

Right.. if the two threads in the example used DAG lock order, they would still deadlock, yes ? OK perhaps i'm blind today but does DAG lock order bring some benefits over the existing lock order (object IDs) ? It seems to be a less generic solution, as it depends on both events being in DAG and able to determine the DAG order quickly. I'm not sure that would be possible in all cases. But if it's better in some significant way, we should consider it.

For example, an internal clretainevent function that just increments the locked event reference count.

Right, for that purpose we already have macro POCL_RETAIN_OBJECT_UNLOCKED. But this is a relatively simple case, there are more problematic cases.

Probably the main structural problem for your case is that a lot (most?) of PoCL code is written with the assumption that events will not fail to execute. And the OpenCL spec used to be a bit "underspecified" on what should happen in that case.

0 replies

RABijl · 2024-05-10T18:43:31Z

RABijl
May 10, 2024
Author

No surprise there. The big question is how difficult it is to find & fix these cases. I think at least it should be relatively easy to find (where the code unlocks just 1 lock out of 2).

I had a look and it turns out it is only happening in the common.c broadcast function. However this function does call a number of other functions that in turn lock and unlock the wait events. Since the lock order with lock_events_in_order is only known during runtime, it is not easy to enforce the lock order in the nested functions. So we'd either need to have only one lock active at a time or make use of a DAG.

Right.. if the two threads in the example used DAG lock order, they would still deadlock, yes ?

yes with a DAG that has l1->l2 this example would also deadlock since the lock order was not respected. The DAG makes it easier to see what the lock order is statically.

does DAG lock order bring some benefits over the existing lock order (object IDs) ?

It makes easier to determine the lock statically.
in the case of broadcast:
- it would allow the notifier event to be locked while the wait event is locked and unlocked multiple times
- in the rare case that the wait event id is smaller than the notifier id, you could still lock the notifier event and then the wait event.
DAGs could also be more easy to reason about more theoretically.

It seems to be a less generic solution, as it depends on both events being in DAG and able to determine the DAG order quickly.

There basically is already one if you traverse the events in the notify list. So it'd be then first lock the notifier event, then the wait event.

Right, for that purpose we already have macro POCL_RETAIN_OBJECT_UNLOCKED. But this is a relatively simple case, there are more problematic cases.

Yes this would be pretty useful, it unfortunately is not used much (or maybe not at all) in the pocl_update_event_finished chain of functions. That reminds me, imo it should be used on each event added to notifier list in pocl_create_event_sync and removed in the broadcast.

Probably the main structural problem for your case is that a lot (most?) of PoCL code is written with the assumption that events will not fail to execute.

Yes a lot of issues started coming out of the woodwork once a device is suddenly is not available. With normal devices this would be very uncommon, but with the addition of PoCL-R this is a real possibility.

0 replies

RABijl · 2024-05-16T05:15:11Z

RABijl
May 16, 2024
Author

I've made a couple of diagrams showing the locking behavior through different functions of just events. Fortunately, it is just broadcast and pocl_create_event_sync that make use of lock_events_inorder. I believe that if we adopt a policy of first locking the notifier event and then the wait event there, there should not be any direct event to event lock order violations.

source files for obsidian canvas:
Archive.zip

1 reply

pjaaskel May 17, 2024
Maintainer

I get a strong feeling of premature optimization here and we could spend the time for something more productive, perhaps. We used to care more about ultra fast event handling while working on Ville's master's thesis 10 years ago, but nowadays we have command buffers for optimizing task graphs as a whole.

I suggest we 1) coarsen the event locks to a single runtime-wide Global Event Lock (GEL) 2) make it recursive so the same thread can lock multiple times (to avoid lockups in calls where also the callee wants the lock) 3) check if the perf. impact is something we care about and then look at making critical sections shorter (while still using the GEL instead of per event locks).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on how should locks be applied in PoCL #1463

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Discussion on how should locks be applied in PoCL #1463

RABijl May 10, 2024

1. The thread sanitizer throws a number of warnings

2. The event.wait_list should not be used to interact with events in that list.

3. Improper use of lock_in_order can still lead to deadlocks

Replies: 3 comments · 1 reply

franz May 10, 2024 Maintainer

RABijl May 10, 2024 Author

RABijl May 16, 2024 Author

pjaaskel May 17, 2024 Maintainer

RABijl
May 10, 2024

Replies: 3 comments 1 reply

franz
May 10, 2024
Maintainer

RABijl
May 10, 2024
Author

RABijl
May 16, 2024
Author

pjaaskel May 17, 2024
Maintainer