Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pocl_util: fix numberous deadlocks on disconnect #1459

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

RABijl
Copy link
Contributor

@RABijl RABijl commented May 8, 2024

non exhaustive list:

  • fix event being marked complete after being marked failed
  • fix deadlock between event and memobj happening in pocl_update_event_finished and pocl_create_migration_commands
  • fix deadlock of event trying to remove it self from wait_list of event trying notifying that it has failed.
  • catch error when a event is created on a command queue that is not available

I'll add some comments to explain parts of the commit

@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line,
POCL_UNLOCK_OBJ (event);
ops->broadcast (event);

/* With remote being asynchronous it is possible that an event is signaled as
Copy link
Contributor Author

@RABijl RABijl May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if an event is marked as failed, it will lock (when broadcasting), notify and mark all events in it's notify list as failed as well. event chains that look something like the image will be deadlocked
Untitled Diagram.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've put this here as a comment since it is about code that is being removed. I could add a comment above the broadcast function warning about the locking behavior it has.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding more code documentation would be beneficial, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've left a warning in the pocl_cl.h description of the broadcast function

* @warning Eventually clReleaseEvent is called, so this function can end up
* freeing the event.
* @param event to mark as device lost
*/
void
pocl_update_event_device_lost (cl_event event)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the remote driver is the only one to call this function

Copy link
Member

@pjaaskel pjaaskel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Deadlock fixes are very valuable. Just some minor stylistic things. The CIs passed so this will go in after those are fixed.

lib/CL/devices/basic/basic.c Outdated Show resolved Hide resolved
lib/CL/devices/proxy/pocl_proxy.c Outdated Show resolved Hide resolved
lib/CL/devices/pthread/pthread.c Outdated Show resolved Hide resolved
lib/CL/devices/remote/remote.c Outdated Show resolved Hide resolved
lib/CL/devices/remote/remote.c Outdated Show resolved Hide resolved
lib/CL/pocl_util.c Outdated Show resolved Hide resolved
lib/CL/pocl_util.c Outdated Show resolved Hide resolved
@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line,
POCL_UNLOCK_OBJ (event);
ops->broadcast (event);

/* With remote being asynchronous it is possible that an event is signaled as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.

lib/CL/pocl_util.c Outdated Show resolved Hide resolved
lib/CL/pocl_util.c Outdated Show resolved Hide resolved
non exhaustive list:
* fix event being marked complete after being
 marked failed
* fix deadlock between event and memobj
 happening in pocl_update_event_finished and
 pocl_create_migration_commands
* fix deadlock of event trying to remove it
 self from wait_list of event trying notifying
 that it has failed.
* catch error when a event is created on a
 command queue that is not available
@pjaaskel
Copy link
Member

With this branch I see a deadlock quite consistently roughly every 2nd time when I use two CPU remotes locally with the jacobian multi-device SYCL test case:

pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz



^C
pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
OK

     kernel                           launches        total us          avg us
  1) _ZTSZZZ4mainENKUlR8subarrayE_clES0_ENKUlRT_E0_clIN4sycl3_V17handlerEEEDaS3_EUlS2_E_        200       131074132  87%     655370
  2) _ZTSZZN4sycl3_V16detail16NDRangeReductionILNS1_9reduction8strategyE1EE3runINS1_9auto_nameELi1ENS0_3ext6oneapi12experimental10propertiesISt5tupleIJEEEEZNS1_22reduction_parallel_forIS7_LS4_0ELi2ESE_JNS1_14reduction_implIdSt4plusIvELi0ELm1ELb0ENS0_6bufferIdLi1ENS1_17aligned_allocatorIdEEvEEEEZZZ4mainENKUlR8subarrayE_clESP_ENKUlRT_E1_clINS0_7handlerEEEDaSS_EUlSR_RT0_E_EEEvRSV_NS0_5rangeIXT1_EEET2_DpT3_EUlSR_DpRT0_E_SN_EEvSZ_RSt10shared_ptrINS1_10queue_implEENS0_8nd_rangeIXT0_EEERT1_RT3_RS12_ENKUlSR_E_clINS0_8accessorIiLi1ELNS0_6access4modeE1026ELNS1N_6targetE2014ELNS1N_11placeholderE0ENS9_22accessor_property_listIJEEEEEEEDaSR_EUlNS0_7nd_itemILi1EEEE_         20        14498641   9%     724932
  3) _ZTSZZZ4mainENKUlR8subarrayE_clES0_ENKUlRT_E_clIN4sycl3_V17handlerEEEDaS3_EUlS2_E_        200         3496455   2%      17482
                                    ==========      ========== ==== ==========
                                           420       149069228 100%     354926
pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

^C

If I enable logging on pocld it doesn't get stuck so often, so some sort of race perhaps.

Roughly 10 executions in a row didn't get stuck with the main branch.

@pjaaskel
Copy link
Member

pjaaskel commented May 10, 2024

OK, after some 8 runs I managed to make it deadlock with log on in pocld1:

[2024-05-10 10:05:48.655582975] PoCL-D: in fn virtual int SharedCLContext::writeBuffer(uint64_t, uint32_t, uint64_t, int, size_t, size_t, void*, EventTiming_t&, uint32_t, uint64_t*) at line 2354:
 |    EVENTS |  writeBuffer event 271
[2024-05-10 10:05:48.655599219] PoCL-D: in fn writeThread at line 223:
 |   GENERAL |  WT_F: SENDING MESSAGE, ID: 2354 TYPE: MigrateD2DReply SIZE: 144 EXTRA: 0 FAILED: 0
[2024-05-10 10:05:48.655631079] PoCL-D: in fn writeThread at line 252:
 |   GENERAL |  WT_F: MESSAGE FULLY WRITTEN, ID: 2354
[2024-05-10 10:05:48.655635972] PoCL-D: in fn notifyEvent at line 343:
 |    EVENTS |  Updating event 271 status to 0
[2024-05-10 10:05:48.655643291] PoCL-D: in fn virtual void SharedCLContext::notifyEvent(uint64_t, cl_int) at line 827:
 |    EVENTS |  271: only native event exists, doing nothing
[2024-05-10 10:05:48.655657558] PoCL-D: in fn writerThread at line 95:
 |   GENERAL |  PHW: SENDING MESSAGE, ID: 2354 TYPE: NotifyEvent SIZE: 72, TO 44

I'm not sure if this is your branch or also in main as I've seen these occasionally when testing. Main feels generally more stable, I'm not able to make it deadlock easily, but I see one crash occasionally:

pocld: /home/pjaaskel/src/pocl/pocld/cmd_queue.cc:177: void CommandQueue::RunCommand(Request*): Assertion `p.native.get()' failed.

Which could be due to a race.

@RABijl
Copy link
Contributor Author

RABijl commented May 10, 2024

If the main is more stable, i'm fine with waiting to merge this. I'm pretty sure I didn't get all the potential deadlocks scenarios, just the ones I experienced with my combination of hardware and workload. I've made an issue (#1461) discussing more general locking topics. Maybe from there we can see how to move further.

include/pocl.h Outdated
@@ -516,6 +516,11 @@ struct _cl_command_node
cl_device_id device;
/* The index of the targeted device in the **program** device list. */
unsigned program_device_i;
/*
* -1: command_failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best add macros for these to avoid magic numbers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, enum even better as we allow them in C code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've changed it to an enum

@@ -483,10 +483,16 @@ typedef union
_cl_command_svm_memadvise mem_advise;
} _cl_command_t;

typedef enum
{
COMMAND_FAILED = -1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
COMMAND_FAILED = -1,
POCL_COMMAND_FAILED = -1,

And if you don't need to force the enum number, better just leave it out. 0 might have significance as it's the default if zero initialized.

* 1: ready
*/
cl_int ready;
command_node_state node_state;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
command_node_state node_state;
pocl_command_node_state node_state;

Due to lack of namespaces, best add name prefixes.

@pjaaskel pjaaskel marked this pull request as draft May 10, 2024 13:14
@@ -599,7 +599,7 @@ pocl_create_event_sync (cl_event waiting_event, cl_event notifier_event)
}
}

if (notifier_event->status == CL_COMPLETE)
if (notifier_event->status <= CL_COMPLETE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems creepy. Can you add a comment here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is to prevent an event sync to be created with an event that has already failed. like completed events, failed events only notify waiting events when the event gets marked as failed.

ps. this branch is still a wip and some commits might not make it in the final branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the creepiness I meant is due to the all encompassing <= operator. Perhaps compare directly to CL_FAILED or whatever the status is for that.

Copy link
Contributor Author

@RABijl RABijl May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see, the thing is that an event can have pretty much any error code, cl_complete is 0, so now we catch all types. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#clGetEventInfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then I'd even use a magic number here (< 0) and add a comment since the spec says: Or an error code given by a negative integer value (command was abnormally terminated - this may be caused by a bad memory access etc.). These error codes come from the same set of error codes that are returned from the platform or runtime API calls as return values or errcode_ret values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the <= CL_COMPLETE statement is understood to mean cl_complete or an error, it is done in more places. I think that the current comment is sufficient in explaining it:

   // If the notifier event is already complete (or failed),
    // don't create an event sync. This is fine since if the wait
    // event has no notifier events and gets submitted, it can start
    // right away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the spec says "negative" not "smaller or equal to CL_COMPLETE, whatever that is" :) Not a big deal, just caught my eye.

@RABijl RABijl force-pushed the deadlock-disconnect branch 2 times, most recently from 4ea6240 to 7156f9e Compare May 17, 2024 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants