-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pocl_util: fix numberous deadlocks on disconnect #1459
base: main
Are you sure you want to change the base?
Conversation
@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line, | |||
POCL_UNLOCK_OBJ (event); | |||
ops->broadcast (event); | |||
|
|||
/* With remote being asynchronous it is possible that an event is signaled as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if an event is marked as failed, it will lock (when broadcasting), notify and mark all events in it's notify list as failed as well. event chains that look something like the image will be deadlocked
Untitled Diagram.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've put this here as a comment since it is about code that is being removed. I could add a comment above the broadcast function warning about the locking behavior it has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding more code documentation would be beneficial, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've left a warning in the pocl_cl.h description of the broadcast function
* @warning Eventually clReleaseEvent is called, so this function can end up | ||
* freeing the event. | ||
* @param event to mark as device lost | ||
*/ | ||
void | ||
pocl_update_event_device_lost (cl_event event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the remote driver is the only one to call this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Deadlock fixes are very valuable. Just some minor stylistic things. The CIs passed so this will go in after those are fixed.
@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line, | |||
POCL_UNLOCK_OBJ (event); | |||
ops->broadcast (event); | |||
|
|||
/* With remote being asynchronous it is possible that an event is signaled as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.
non exhaustive list: * fix event being marked complete after being marked failed * fix deadlock between event and memobj happening in pocl_update_event_finished and pocl_create_migration_commands * fix deadlock of event trying to remove it self from wait_list of event trying notifying that it has failed. * catch error when a event is created on a command queue that is not available
79e61eb
to
ec6e970
Compare
With this branch I see a deadlock quite consistently roughly every 2nd time when I use two CPU remotes locally with the jacobian multi-device SYCL test case:
If I enable logging on pocld it doesn't get stuck so often, so some sort of race perhaps. Roughly 10 executions in a row didn't get stuck with the main branch. |
OK, after some 8 runs I managed to make it deadlock with log on in pocld1:
I'm not sure if this is your branch or also in main as I've seen these occasionally when testing. Main feels generally more stable, I'm not able to make it deadlock easily, but I see one crash occasionally:
Which could be due to a race. |
If the main is more stable, i'm fine with waiting to merge this. I'm pretty sure I didn't get all the potential deadlocks scenarios, just the ones I experienced with my combination of hardware and workload. I've made an issue (#1461) discussing more general locking topics. Maybe from there we can see how to move further. |
include/pocl.h
Outdated
@@ -516,6 +516,11 @@ struct _cl_command_node | |||
cl_device_id device; | |||
/* The index of the targeted device in the **program** device list. */ | |||
unsigned program_device_i; | |||
/* | |||
* -1: command_failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best add macros for these to avoid magic numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, enum even better as we allow them in C code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've changed it to an enum
@@ -483,10 +483,16 @@ typedef union | |||
_cl_command_svm_memadvise mem_advise; | |||
} _cl_command_t; | |||
|
|||
typedef enum | |||
{ | |||
COMMAND_FAILED = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COMMAND_FAILED = -1, | |
POCL_COMMAND_FAILED = -1, |
And if you don't need to force the enum number, better just leave it out. 0 might have significance as it's the default if zero initialized.
* 1: ready | ||
*/ | ||
cl_int ready; | ||
command_node_state node_state; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
command_node_state node_state; | |
pocl_command_node_state node_state; |
Due to lack of namespaces, best add name prefixes.
lib/CL/pocl_util.c
Outdated
@@ -599,7 +599,7 @@ pocl_create_event_sync (cl_event waiting_event, cl_event notifier_event) | |||
} | |||
} | |||
|
|||
if (notifier_event->status == CL_COMPLETE) | |||
if (notifier_event->status <= CL_COMPLETE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems creepy. Can you add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is to prevent an event sync to be created with an event that has already failed. like completed events, failed events only notify waiting events when the event gets marked as failed.
ps. this branch is still a wip and some commits might not make it in the final branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the creepiness I meant is due to the all encompassing <=
operator. Perhaps compare directly to CL_FAILED or whatever the status is for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, the thing is that an event can have pretty much any error code, cl_complete is 0, so now we catch all types. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#clGetEventInfo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Then I'd even use a magic number here (< 0) and add a comment since the spec says: Or an error code given by a negative integer value (command was abnormally terminated - this may be caused by a bad memory access etc.). These error codes come from the same set of error codes that are returned from the platform or runtime API calls as return values or errcode_ret values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the <= CL_COMPLETE
statement is understood to mean cl_complete or an error, it is done in more places. I think that the current comment is sufficient in explaining it:
// If the notifier event is already complete (or failed),
// don't create an event sync. This is fine since if the wait
// event has no notifier events and gets submitted, it can start
// right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the spec says "negative" not "smaller or equal to CL_COMPLETE, whatever that is" :) Not a big deal, just caught my eye.
4ea6240
to
7156f9e
Compare
7156f9e
to
d3eef2e
Compare
non exhaustive list:
I'll add some comments to explain parts of the commit