Introducing Timeline Semaphores API to Kompute #238

axsaucedo · 2021-08-17T07:25:15Z

As suggested by @ChenKuo in #52 this would encompass adding support for the Vulkan Timeline Semaphores introduced in 1.2 https://www.khronos.org/blog/vulkan-timeline-semaphores. This would mean that we would have to either drop support for pre-1.2 Vulkan, or make sure this feature is behind a feature flag / compile-time macro, and it is tested for Vulkan 1.1.x and 1.2.x.

Currently what we'll need to first explore is an interface that can provide a high level interface than the one provided below, as well as understanding what are the corner case behaviours that could arise from objects being removed, or failed, etc.

The interface provided below is the current proposed structure for the abstraction of the timeline API, however this interface may not be possible given that the sequence can only hold a single set of operations per sequnce, and whenever a new one is recorded, it would either apend or clear the previous ones.

The original proposal is below

Right now the only synchronization options (that I can see) are running eval() synchronously or use eval_await() asynchronously. Both cause the thread to stop, which translate to a loss of time when it can be sending the next batch to queue. Vulkan 1.2 has the Timeline Semaphores API which seems to be a good solution if we can integrate it to Kompute API.

For example, suppose I have algorithm A using tensors a, algorithm B using tensors b, and algorithm C using tensors a, b, c. A and B are independent, but C is dependent on the result of A and B. We only need the result from C, not intermediate results from A and B. This is how I wish the code would look in Python:
(I am not sure if my understanding of Timeline Semaphore is correct. It is kind of confusing.)

timeline_a = kp.TimelineSemaphore()
timeline_b = kp.TimelineSemaphore()
timeline_c = kp.TimelineSemaphore()

sequence
  .record(kp.OpTensorSyncDevice(params_a))
  .eval_async(timeline_a(wait=0, signal=1)) # copy params_a to device asap
  .record(kp.OpAlgoDispatch(algo_a))
  .eval_async(timeline_a(wait=1, signal=2)) # run algo_a after params_a is copied to device
  .record(kp.OpTensorSyncDevice(params_b))
  .eval_async(timeline_b(wait=0, signal=1)) # copy params_b to device asap
  .record(kp.OpAlgoDispatch(algo_b))
  .eval_async(timeline_b(wait=1, signal=2)) # run algo_b after params_b is copied to device
  .record(kp.OpTensorSyncDevice(c))
  .eval_async(timeline_c(wait=0, signal=1)) # copy params_c to device asap
  .record(kp.OpAlgoDispatch(algo_c)) 
  .eval_async(
      timeline_a(wait=2, signal=4),
      timeline_b(wait=2, signal=4),
      timeline_c(wait=1, signal=2)) #run algo_c after algo_a and algo_b finish, and params_c is copied
  .record(kp.OpTensorSyncLocal(params_c))
  .eval_async(timeline_c(wait=2, signal=3)) # copy params_c to host after algo_c is done
  .eval_await(timeline_c(wait=3, signal=4))  # wait for params_c to be copied to host
# now we can use the result from C on host
print( param.data() for param in params_c)

There is a (partial) workaround by creating multiple threads and Sequence objects, so one thread-Sequence can move data around while the other is waiting. However, this still does not solve the dependency issue, I think. I am not an expert in Vulkan or C++, so what I wrote may be wrong. Maybe there is a better way I do not know of. If you know please let me know.
Thanks.

The text was updated successfully, but these errors were encountered:

ChenKuo · 2021-08-17T11:27:41Z

Note that the page I linked mentions about the implementation of the timeline semaphore API as a Vulkan 1.1 layer as part of the Vulkan-ExtensionLayer project. So it should be possible to make it work in Vulkan 1.1.

axsaucedo · 2021-08-17T11:40:27Z

Thanks @ChenKuo - also following up from the question you asked in #52

@axsaucedo Thanks for your response. I see how I can use OpMemoryBarrier to implement dependency. This way it can also submit everything in one batch, so it should be more efficient than the coarse-grain synchronization by using semaphores. The way I think semaphores would be useful is we can synchronized across different queues. So we can use the result of the first batch to submit the second batch in the other queue, before the first batch is finished.

Ok that sounds good, in that casa does the OpMemoryBarrier solve your current usecase? If that is the case, do you have a relevant example where you would need to use the semaphores?

We would need to have a concrete example, as it seems that implementing the semaphore functionality for interdependency would need to consider DAG-like dependencies between operations, whcih may require more thought to ensure that it indeed works as expected, as opposed to implemented as a workaround that just exposes the functionality.

ChenKuo · 2021-08-17T12:12:11Z

@axsaucedo Let's say I have a simple rendering pipeline implementation, so for N geometries, we need to do rasterization -> color-blending N times. If I can synchronize 2 queues, I can let queue1 focus on rasterization algorithm (only need to update vertex positions index each pass), and queue2 focus on blending algorithm (need to use the result in queue1 as it becomes available), but they work in parallel. If one queue falls behind, I can even use a third queue to balance the workload, which would require even more advanced synchronization. Timeline Semaphores make this very easy because I just need to match the values of rasterization_timeline to blending_timeline.
The reason why I do not just use vertex-fragment shaders is because this is a differentiable render pipeline and I need to be able to compute the gradient from pixels back to parameters. My implementation right now is in pytorch, and it does not have this level of optimization. I am not in a hurry for this feature, especially since it is too early to make this kind of optimization at this stage. But it is a direction I am aiming toward.

ChenKuo · 2021-08-17T13:53:30Z

I am not sure if this code is syntactically correct, but this is the general idea for my scenario above.

# in the rasterization thread ...
for rasterization_pass_number in range(N):
    sq1.
    .record(
        kp.OpAlgoDispatch(rasterization_algo,
        [rasterization_pass_number])) # in the shader this index points to the vertices positions
    .eval_async( # do not go over 10 passes ahead of blending_algo, we have limited memory to store the result
        rasterization_timeline(signal=rasterization_pass_number+1), # no wait because rasterization is independent
        blending_timeline(wait=rasterization_pass_number-10))

# in the blending thread ...
for blending_pass_number in range(N):
    sq2.
    .record(
        kp.OpAlgoDispatch(blending_algo,
        [blending_pass_number])) # in the shader we can use this index to find the rasterization result
    .eval_async( # rasterization_algo need to be at least 1 pass ahead of blending_algo
        rasterization_timeline(wait=blending_pass_number+1), 
        blending_timeline(wait=blending_pass_number, signal=blending_pass_number+1))

axsaucedo mentioned this issue Aug 17, 2021

Explore / discuss for potential ideas or improvements #52

Open

axsaucedo added this to To do in 1.0.0 via automation May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing Timeline Semaphores API to Kompute #238

Introducing Timeline Semaphores API to Kompute #238

axsaucedo commented Aug 17, 2021 •

edited

ChenKuo commented Aug 17, 2021

axsaucedo commented Aug 17, 2021

ChenKuo commented Aug 17, 2021 •

edited

ChenKuo commented Aug 17, 2021 •

edited

Introducing Timeline Semaphores API to Kompute #238

Introducing Timeline Semaphores API to Kompute #238

Comments

axsaucedo commented Aug 17, 2021 • edited

The original proposal is below

ChenKuo commented Aug 17, 2021

axsaucedo commented Aug 17, 2021

ChenKuo commented Aug 17, 2021 • edited

ChenKuo commented Aug 17, 2021 • edited

axsaucedo commented Aug 17, 2021 •

edited

ChenKuo commented Aug 17, 2021 •

edited

ChenKuo commented Aug 17, 2021 •

edited