pocl_util: fix numberous deadlocks on disconnect #1459

RABijl · 2024-05-08T10:49:35Z

non exhaustive list:

fix event being marked complete after being marked failed
fix deadlock between event and memobj happening in pocl_update_event_finished and pocl_create_migration_commands
fix deadlock of event trying to remove it self from wait_list of event trying notifying that it has failed.
catch error when a event is created on a command queue that is not available

I'll add some comments to explain parts of the commit

RABijl · 2024-05-08T10:57:54Z

lib/CL/pocl_util.c

@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line,
  POCL_UNLOCK_OBJ (event);
  ops->broadcast (event);

-  /* With remote being asynchronous it is possible that an event is signaled as


if an event is marked as failed, it will lock (when broadcasting), notify and mark all events in it's notify list as failed as well. event chains that look something like the image will be deadlocked
Untitled Diagram.pdf

Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.

i've put this here as a comment since it is about code that is being removed. I could add a comment above the broadcast function warning about the locking behavior it has.

I think adding more code documentation would be beneficial, yes.

i've left a warning in the pocl_cl.h description of the broadcast function

RABijl · 2024-05-08T11:00:30Z

lib/CL/pocl_util.c

+ * @warning Eventually clReleaseEvent is called, so this function can end up
+ * freeing the event.
+ * @param event to mark as device lost
+ */
 void
 pocl_update_event_device_lost (cl_event event)


the remote driver is the only one to call this function

pjaaskel

Thanks. Deadlock fixes are very valuable. Just some minor stylistic things. The CIs passed so this will go in after those are fixed.

lib/CL/devices/basic/basic.c

lib/CL/devices/proxy/pocl_proxy.c

lib/CL/devices/pthread/pthread.c

lib/CL/devices/remote/remote.c

lib/CL/pocl_util.c

pjaaskel · 2024-05-08T14:18:35Z

lib/CL/pocl_util.c

@@ -2478,42 +2513,6 @@ pocl_update_event_finished (cl_int status, const char *func, unsigned line,
  POCL_UNLOCK_OBJ (event);
  ops->broadcast (event);

-  /* With remote being asynchronous it is possible that an event is signaled as


Can you convert this to a code comment somewhere reasonable? It will get lost if it's only a PR comment.

lib/CL/pocl_util.c

non exhaustive list: * fix event being marked complete after being marked failed * fix deadlock between event and memobj happening in pocl_update_event_finished and pocl_create_migration_commands * fix deadlock of event trying to remove it self from wait_list of event trying notifying that it has failed. * catch error when a event is created on a command queue that is not available

pjaaskel · 2024-05-10T09:57:37Z

With this branch I see a deadlock quite consistently roughly every 2nd time when I use two CPU remotes locally with the jacobian multi-device SYCL test case:

pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz



^C
pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
OK

     kernel                           launches        total us          avg us
  1) _ZTSZZZ4mainENKUlR8subarrayE_clES0_ENKUlRT_E0_clIN4sycl3_V17handlerEEEDaS3_EUlS2_E_        200       131074132  87%     655370
  2) _ZTSZZN4sycl3_V16detail16NDRangeReductionILNS1_9reduction8strategyE1EE3runINS1_9auto_nameELi1ENS0_3ext6oneapi12experimental10propertiesISt5tupleIJEEEEZNS1_22reduction_parallel_forIS7_LS4_0ELi2ESE_JNS1_14reduction_implIdSt4plusIvELi0ELm1ELb0ENS0_6bufferIdLi1ENS1_17aligned_allocatorIdEEvEEEEZZZ4mainENKUlR8subarrayE_clESP_ENKUlRT_E1_clINS0_7handlerEEEDaSS_EUlSR_RT0_E_EEEvRSV_NS0_5rangeIXT1_EEET2_DpT3_EUlSR_DpRT0_E_SN_EEvSZ_RSt10shared_ptrINS1_10queue_implEENS0_8nd_rangeIXT0_EEERT1_RT3_RS12_ENKUlSR_E_clINS0_8accessorIiLi1ELNS0_6access4modeE1026ELNS1N_6targetE2014ELNS1N_11placeholderE0ENS9_22accessor_property_listIJEEEEEEEDaSR_EUlNS0_7nd_itemILi1EEEE_         20        14498641   9%     724932
  3) _ZTSZZZ4mainENKUlR8subarrayE_clES0_ENKUlRT_E_clIN4sycl3_V17handlerEEEDaS3_EUlS2_E_        200         3496455   2%      17482
                                    ==========      ========== ==== ==========
                                           420       149069228 100%     354926
pjaaskel@pjaaskel-mobl1:~/src/pocl/build-remote-llvm-17$ POCL_TRACING=cq POCL_KERNEL_CACHE=1 ONEAPI_DEVICE_SELECTOR="opencl:cpu" POCL_DEBUG=0 POCL_DEVICES="remote remote" POCL_REMOTE0_PARAMETERS="127.0.0.1:13000" POCL_REMOTE1_PARAMETERS="127.0.0.1:13003" LD_PRELOAD=/opt/intel/oneapi/compiler/2024.1/lib/libsycl.so  examples/oneapi-samples/src/oneapi-samples-build/jacobian-solver/jacobian_solver_multidevice_sycl 
Running on 2 device(s)
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
	Using device: cpu-tigerlake-11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz

^C

If I enable logging on pocld it doesn't get stuck so often, so some sort of race perhaps.

Roughly 10 executions in a row didn't get stuck with the main branch.

pjaaskel · 2024-05-10T10:07:32Z

OK, after some 8 runs I managed to make it deadlock with log on in pocld1:

[2024-05-10 10:05:48.655582975] PoCL-D: in fn virtual int SharedCLContext::writeBuffer(uint64_t, uint32_t, uint64_t, int, size_t, size_t, void*, EventTiming_t&, uint32_t, uint64_t*) at line 2354:
 |    EVENTS |  writeBuffer event 271
[2024-05-10 10:05:48.655599219] PoCL-D: in fn writeThread at line 223:
 |   GENERAL |  WT_F: SENDING MESSAGE, ID: 2354 TYPE: MigrateD2DReply SIZE: 144 EXTRA: 0 FAILED: 0
[2024-05-10 10:05:48.655631079] PoCL-D: in fn writeThread at line 252:
 |   GENERAL |  WT_F: MESSAGE FULLY WRITTEN, ID: 2354
[2024-05-10 10:05:48.655635972] PoCL-D: in fn notifyEvent at line 343:
 |    EVENTS |  Updating event 271 status to 0
[2024-05-10 10:05:48.655643291] PoCL-D: in fn virtual void SharedCLContext::notifyEvent(uint64_t, cl_int) at line 827:
 |    EVENTS |  271: only native event exists, doing nothing
[2024-05-10 10:05:48.655657558] PoCL-D: in fn writerThread at line 95:
 |   GENERAL |  PHW: SENDING MESSAGE, ID: 2354 TYPE: NotifyEvent SIZE: 72, TO 44

I'm not sure if this is your branch or also in main as I've seen these occasionally when testing. Main feels generally more stable, I'm not able to make it deadlock easily, but I see one crash occasionally:

pocld: /home/pjaaskel/src/pocl/pocld/cmd_queue.cc:177: void CommandQueue::RunCommand(Request*): Assertion `p.native.get()' failed.

Which could be due to a race.

RABijl · 2024-05-10T10:45:22Z

If the main is more stable, i'm fine with waiting to merge this. I'm pretty sure I didn't get all the potential deadlocks scenarios, just the ones I experienced with my combination of hardware and workload. I've made an issue (#1461) discussing more general locking topics. Maybe from there we can see how to move further.

pjaaskel · 2024-05-10T12:03:50Z

include/pocl.h

@@ -516,6 +516,11 @@ struct _cl_command_node
  cl_device_id device;
  /* The index of the targeted device in the **program** device list. */
  unsigned program_device_i;
+  /*
+   * -1: command_failed


Best add macros for these to avoid magic numbers.

Well, enum even better as we allow them in C code.

i've changed it to an enum

pjaaskel · 2024-05-10T13:11:47Z

include/pocl.h

@@ -483,10 +483,16 @@ typedef union
  _cl_command_svm_memadvise mem_advise;
 } _cl_command_t;

+typedef enum
+{
+  COMMAND_FAILED = -1,


Suggested change

COMMAND_FAILED = -1,

POCL_COMMAND_FAILED = -1,

And if you don't need to force the enum number, better just leave it out. 0 might have significance as it's the default if zero initialized.

pjaaskel · 2024-05-10T13:12:30Z

include/pocl.h

-   * 1: ready
-   */
-  cl_int ready;
+  command_node_state node_state;


Suggested change

command_node_state node_state;

pocl_command_node_state node_state;

Due to lack of namespaces, best add name prefixes.

pjaaskel · 2024-05-16T14:56:00Z

lib/CL/pocl_util.c

@@ -599,7 +599,7 @@ pocl_create_event_sync (cl_event waiting_event, cl_event notifier_event)
        }
    }

-  if (notifier_event->status == CL_COMPLETE)
+  if (notifier_event->status <= CL_COMPLETE)


This seems creepy. Can you add a comment here?

it is to prevent an event sync to be created with an event that has already failed. like completed events, failed events only notify waiting events when the event gets marked as failed.

ps. this branch is still a wip and some commits might not make it in the final branch

OK, the creepiness I meant is due to the all encompassing <= operator. Perhaps compare directly to CL_FAILED or whatever the status is for that.

ah i see, the thing is that an event can have pretty much any error code, cl_complete is 0, so now we catch all types. https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#clGetEventInfo

Ah, I see. Then I'd even use a magic number here (< 0) and add a comment since the spec says: Or an error code given by a negative integer value (command was abnormally terminated - this may be caused by a bad memory access etc.). These error codes come from the same set of error codes that are returned from the platform or runtime API calls as return values or errcode_ret values.

I'd say the <= CL_COMPLETE statement is understood to mean cl_complete or an error, it is done in more places. I think that the current comment is sufficient in explaining it:

// If the notifier event is already complete (or failed), // don't create an event sync. This is fine since if the wait // event has no notifier events and gets submitted, it can start // right away.

Yeah, but the spec says "negative" not "smaller or equal to CL_COMPLETE, whatever that is" :) Not a big deal, just caught my eye.

RABijl commented May 8, 2024

View reviewed changes

pjaaskel requested changes May 8, 2024

View reviewed changes

RABijl mentioned this pull request May 10, 2024

Discussion on how should locks be applied in PoCL #1461

Closed

RABijl force-pushed the deadlock-disconnect branch from 79e61eb to ec6e970 Compare May 10, 2024 08:45

pjaaskel reviewed May 10, 2024

View reviewed changes

pocl.h: change command_node.ready to enum

693ed10

pjaaskel requested changes May 10, 2024

View reviewed changes

pjaaskel marked this pull request as draft May 10, 2024 13:14

pjaaskel reviewed May 16, 2024

View reviewed changes

RABijl force-pushed the deadlock-disconnect branch 2 times, most recently from 4ea6240 to 7156f9e Compare May 17, 2024 06:00

util: don't create sync with failed event

d3eef2e

RABijl force-pushed the deadlock-disconnect branch from 7156f9e to d3eef2e Compare May 17, 2024 06:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pocl_util: fix numberous deadlocks on disconnect #1459

pocl_util: fix numberous deadlocks on disconnect #1459

RABijl commented May 8, 2024

RABijl May 8, 2024 •

edited

pjaaskel May 8, 2024

RABijl May 10, 2024

pjaaskel May 10, 2024

RABijl May 10, 2024

RABijl May 8, 2024

pjaaskel left a comment

pjaaskel May 8, 2024

pjaaskel commented May 10, 2024

pjaaskel commented May 10, 2024 •

edited

RABijl commented May 10, 2024

pjaaskel May 10, 2024

pjaaskel May 10, 2024

RABijl May 10, 2024

pjaaskel May 10, 2024

pjaaskel May 10, 2024

pjaaskel May 16, 2024

RABijl May 17, 2024

pjaaskel May 17, 2024

RABijl May 17, 2024 •

edited

pjaaskel May 17, 2024

RABijl May 17, 2024

pjaaskel May 17, 2024

	command_node_state node_state;
	pocl_command_node_state node_state;

pocl_util: fix numberous deadlocks on disconnect #1459

Are you sure you want to change the base?

pocl_util: fix numberous deadlocks on disconnect #1459

Conversation

RABijl commented May 8, 2024

RABijl May 8, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjaaskel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjaaskel commented May 10, 2024

pjaaskel commented May 10, 2024 • edited

RABijl commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RABijl May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RABijl May 8, 2024 •

edited

pjaaskel commented May 10, 2024 •

edited

RABijl May 17, 2024 •

edited