[hal/vk] Rework Submission and Surface Synchronization #5681

cwfitzgerald · 2024-05-08T22:55:43Z

Connections

Description

As described in #5559, our submission and surface synchronization was totally messed up. I've revamped it, made smarter classes to help keep all the bookkeeping straight, and generally make the code cleaner and easier to understand.

Testing

works on my machine lel

JMS55 · 2024-05-09T06:04:38Z

Works fine for me running the cube example with VK validation layers enabled

[2024-05-09T06:03:33Z INFO  wgpu_core::instance] Adapter Vulkan AdapterInfo { name: "Intel(R) Arc(TM) Graphics", vendor: 32902, device: 32085, device_type: IntegratedGpu, driver: "Intel Corporation", driver_info: "101.5382", backend: Vulkan }

Vecvec · 2024-05-09T07:31:01Z

This fixes an annoying access violation termination (that I thought was a driver bug) on program exit after a present.

hecrj · 2024-05-09T08:25:51Z

Unfortunately, it does not fix #5644 for me. Same constant validation error on presentation:

[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance] VALIDATION [VUID-vkAcquireNextImageKHR-semaphore-01779 (0x5717e75b)]
        Validation Error: [ VUID-vkAcquireNextImageKHR-semaphore-01779 ] Object 0: handle = 0xba7514000000002a, type = VK_OBJECT_TYPE_SEMAPHORE; | MessageID = 0x5717e75b | vkAcquireNextImageKHR():  Semaphore must not have any pending operations. The Vulkan spec states: If semaphore is not VK_NULL_HANDLE it must not have any uncompleted signal or wait operations pending (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-vkAcquireNextImageKHR-semaphore-01779)
[2024-05-09T08:16:24Z ERROR wgpu_hal::vulkan::instance]         objects: (type: SEMAPHORE, hndl: 0xba7514000000002a, name: ?)

The actual validation error only happens with a debug build, although release feels much slower than 0.19 as well. I believe something is very slow and the errors actually only show up in debug mode.

Flamegraph of cube seems to indicate this—with present taking a huge amount of time:

For comparison, same drivers with 0.19:

Even if I enable Immediate present mode, frames seem to take twice as much time to be rendered with 0.20.

cwfitzgerald · 2024-05-10T23:26:36Z

@hecrj could you try again with these latest commits, someone else who can reproduce the validation error says that this fixed it.

hecrj · 2024-05-11T10:37:33Z

@cwfitzgerald Yep! The latest changes fix the validation errors. Thanks!

I'm still noticing a bit of worse performance in debug mode, but it doesn't seem to be there in release mode anymore. Flamegraphs look the same as before, but I imagine it's expected since now synchronization works properly and there may be more waiting. No errors in any case!

wgpu-hal/src/lib.rs

wgpu-hal/src/vulkan/mod.rs

jimblandy · 2024-05-14T01:26:06Z

I'm actually rewriting these comments as I review so if you're not attached to them then don't worry too much about fixing anything.

cwfitzgerald · 2024-05-14T02:05:11Z

Yeah that's fine - I was just trying to get something there.

wgpu-hal/src/lib.rs

wgpu-hal/src/vulkan/mod.rs

jimblandy · 2024-05-17T20:21:25Z

@cwfitzgerald What do you think of this? da543fb51

My hope was that this would make the states of RelaySemaphores a bit clearer to a new reader. If people should not use wait unless should_wait is true, then let's actually force people to check that condition before they can even get at the value: classic Option. But it does make advance fallible and require us to pass the device.

[edit: made this a little nicer still: RelaySemaphoreState is no longer necessary]

jimblandy · 2024-05-17T21:14:24Z

@cwfitzgerald Here's an even more radical suggestion: b5d52fe

Just use a single semaphore for all queue ordering.

cwfitzgerald · 2024-05-17T21:42:07Z

@jimblandy i think we can go for it, the only question is if we want to still avoid this deadlock in anv mentioned in the old code, it if we consider that old enough to ignore.

/// It would be correct to use a single semaphore there, but 
/// [Intel hangs in `anv_queue_finish`](https://gitlab.freedesktop.org/mesa/mesa/-/issues/5508).

I should have perseved that comment somehow.

jimblandy · 2024-05-18T04:13:42Z

@cwfitzgerald:

I should have perseved that comment somehow.

Ahh, yes, I was reviewing by checking out the branch and just wandering around with rust-analyzer, so I didn't notice that the patch removed the comment. If that's the sole rationale for the entire semaphore-swapping doohickey then we had definitely better have the comment.

Kvark filed that bug in Oct 2021. That's not that long ago, so I think we'd better keep it. It sounds like a bear to debug, and I'd rather not have to debug it again.

jimblandy · 2024-05-18T04:14:28Z

So I guess the remaining question is what you think of this one.

jimblandy · 2024-05-18T05:25:06Z

So here's the latest version of my suggestion:
jimblandy/wgpu@hal-vk-wait-semaphore-is-option~2...jimblandy:wgpu:hal-vk-wait-semaphore-is-option

cwfitzgerald · 2024-05-18T21:14:39Z

Looks fine to me, could you either PR to my fork or just push to it?

jimblandy · 2024-05-20T03:50:15Z

I still want to get a handle on the swapchain semaphores. I've got some work-in-progress docs for those too. I'll take care of this tomorrow morning.

wgpu-hal/src/vulkan/mod.rs

jimblandy · 2024-05-21T00:51:32Z

wgpu-hal/src/vulkan/mod.rs

-        for &surface_texture in surface_textures {
-            wait_stage_masks.push(vk::PipelineStageFlags::TOP_OF_PIPE);
-            wait_semaphores.push(surface_texture.wait_semaphore);
+        // We lock access to all of the semaphores. This may block if two submissions are in flight at the same time.


Please check my understanding, but I think this can't actually block. All submissions using any given SurfaceTexture are required to use the same Fence, and as explained in the docs for wgpu_hal::Queue::submit, all submissions using the same Fence must be synchronized to occur in a particular order.

jimblandy · 2024-05-21T00:55:08Z

wgpu-hal/src/vulkan/instance.rs

+        // Wait for the previously acquired image used by the semaphore to be available.
+        swapchain.device.wait_for_fence(
+            fence,
+            locked_swapchain_semaphores.previously_used_submission_index,
+            timeout_ns,
+        )?;


What does this actually accomplish? Aren't we going to wait for the acquire semaphore anyway?

We must wait until the cpu timeline is sure that the semaphore has been signaled and reset. Otherwise there will be two acquire operations that could signal the same semaphore.

So, in other words:

If we were to remove this wait, then acquisition, command submission, and presentation would all set stuff in motion in the presentation engine or on the GPU, but as far as the CPU is concerned, they'd all return without blocking, so the CPU could just run the process full speed ahead, blow through the entire swapchain, come around to the front of the ring buffer again, and make a fresh call to vkAcquireNextImage passing the same semaphore as last time. This wait is the only thing that rate-limits that process to match actual presentation speed.

Is that correct?

I've pushed another commit documenting this. If my explanation is correct, please mark this as resolved.

jimblandy · 2024-05-21T01:06:14Z

@cwfitzgerald I just pushed two doc commits. Could you look them over for accuracy?

jimblandy · 2024-05-21T01:35:30Z

wgpu-hal/src/vulkan/mod.rs

+        // debug_assert_eq!(
+        //     Arc::as_ptr(&texture.surface_semaphores),
+        //     Arc::as_ptr(&ssc.surface_semaphores[ssc.next_semaphore_index]),
+        //     "Trying to use a surface texture that does not belong to the current swapchain."
+        // );


This should either be uncommented or removed, but we shouldn't leave it in limbo like this.

jimblandy · 2024-05-23T22:11:08Z

rebased on current trunk

Introduce a utility function for making binary semaphores, and use it where appropriate.

This allows us to delete `should_wait`, and makes `RelaySemaphoreState` the same as `RelaySemaphores`, so we can just delete the former, and have `RelaySemaphores::advance` return a clone of `self`'s current state.

cwfitzgerald requested a review from a team as a code owner May 8, 2024 22:55

cwfitzgerald added the PR: needs back-porting PR with a fix that needs to land on crates label May 9, 2024

cwfitzgerald force-pushed the vk/fix-presentation-synchronization branch from ea6e6ba to 329513b Compare May 11, 2024 04:25

Wumpf mentioned this pull request May 12, 2024

Submitting multiple command encoders per frame causes hang on Vulkan #5693

Open

Wumpf linked an issue May 12, 2024 that may be closed by this pull request

Submitting multiple command encoders per frame causes hang on Vulkan #5693

Open

fredizzimo mentioned this pull request May 12, 2024

Remove rust-gpu neovide/vide#13

Merged

jimblandy reviewed May 13, 2024

View reviewed changes

wgpu-hal/src/lib.rs Outdated Show resolved Hide resolved

jimblandy reviewed May 13, 2024

View reviewed changes

wgpu-hal/src/vulkan/mod.rs Outdated Show resolved Hide resolved

EriKWDev mentioned this pull request May 14, 2024

Cannot copy from depth texture to depth texture array even if entire texture is copied #5699

Closed

Wumpf reviewed May 15, 2024

View reviewed changes

wgpu-hal/src/lib.rs Outdated Show resolved Hide resolved

jimblandy reviewed May 17, 2024

View reviewed changes

wgpu-hal/src/vulkan/mod.rs Outdated Show resolved Hide resolved

Wumpf mentioned this pull request May 18, 2024

Update wgpu to 0.20 rerun-io/rerun#6171

Draft

19 tasks

jimblandy mentioned this pull request May 20, 2024

wgpu_hal::Surface::discard_texture broken on Vulkan #5723

Open

jimblandy reviewed May 20, 2024

View reviewed changes

wgpu-hal/src/vulkan/mod.rs Outdated Show resolved Hide resolved

jimblandy reviewed May 21, 2024

View reviewed changes

jimblandy force-pushed the vk/fix-presentation-synchronization branch from 6d2daa8 to c530f01 Compare May 21, 2024 01:05

jimblandy reviewed May 21, 2024

View reviewed changes

jimblandy force-pushed the vk/fix-presentation-synchronization branch from a26aba1 to a5a000c Compare May 23, 2024 22:10

cwfitzgerald and others added 14 commits May 23, 2024 15:50

[hal/vk] Rework Submission and Surface Synchronization

5c339a0

Destory

1bb0524

Comments

dca88c4

Nvidia

c024029

Novideo

a79f8b1

Nits

3a71616

Fix raw-gles

d982907

[hal vulkan] Introduce DeviceShared::new_binary_semaphore.

48e6852

Introduce a utility function for making binary semaphores, and use it where appropriate.

[hal vulkan] Turn RelaySemaphores::wait into an option.

3eeb8c7

This allows us to delete `should_wait`, and makes `RelaySemaphoreState` the same as `RelaySemaphores`, so we can just delete the former, and have `RelaySemaphores::advance` return a clone of `self`'s current state.

[hal] Document Surface methods and some presentation constraints.

addf73a

[hal vulkan] Document SwapchainSemaphores.

62efca3

[hal vulkan] Rename SwapchainSemaphores -> SwapchainImageSemaphores

faa56a2

Doc SwapchainImageSemaphores::previously_used_submission_index.

ab509fb

Incorporate Wumpf's suggestions.

4cf108c

jimblandy force-pushed the vk/fix-presentation-synchronization branch from a5a000c to 4cf108c Compare May 23, 2024 22:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hal/vk] Rework Submission and Surface Synchronization #5681

[hal/vk] Rework Submission and Surface Synchronization #5681

cwfitzgerald commented May 8, 2024 •

edited by Wumpf

JMS55 commented May 9, 2024

Vecvec commented May 9, 2024 •

edited

hecrj commented May 9, 2024 •

edited

cwfitzgerald commented May 10, 2024

hecrj commented May 11, 2024 •

edited

jimblandy commented May 14, 2024

cwfitzgerald commented May 14, 2024

jimblandy commented May 17, 2024 •

edited

jimblandy commented May 17, 2024

cwfitzgerald commented May 17, 2024

jimblandy commented May 18, 2024

jimblandy commented May 18, 2024

jimblandy commented May 18, 2024

cwfitzgerald commented May 18, 2024

jimblandy commented May 20, 2024

jimblandy May 21, 2024

jimblandy May 21, 2024

cwfitzgerald May 21, 2024 •

edited

jimblandy May 21, 2024 •

edited

jimblandy May 21, 2024

jimblandy commented May 21, 2024

jimblandy May 21, 2024

jimblandy commented May 23, 2024

[hal/vk] Rework Submission and Surface Synchronization #5681

Are you sure you want to change the base?

[hal/vk] Rework Submission and Surface Synchronization #5681

Conversation

cwfitzgerald commented May 8, 2024 • edited by Wumpf

JMS55 commented May 9, 2024

Vecvec commented May 9, 2024 • edited

hecrj commented May 9, 2024 • edited

cwfitzgerald commented May 10, 2024

hecrj commented May 11, 2024 • edited

jimblandy commented May 14, 2024

cwfitzgerald commented May 14, 2024

jimblandy commented May 17, 2024 • edited

jimblandy commented May 17, 2024

cwfitzgerald commented May 17, 2024

jimblandy commented May 18, 2024

jimblandy commented May 18, 2024

jimblandy commented May 18, 2024

cwfitzgerald commented May 18, 2024

jimblandy commented May 20, 2024

jimblandy May 21, 2024

Choose a reason for hiding this comment

jimblandy May 21, 2024

Choose a reason for hiding this comment

cwfitzgerald May 21, 2024 • edited

Choose a reason for hiding this comment

jimblandy May 21, 2024 • edited

Choose a reason for hiding this comment

jimblandy May 21, 2024

Choose a reason for hiding this comment

jimblandy commented May 21, 2024

jimblandy May 21, 2024

Choose a reason for hiding this comment

jimblandy commented May 23, 2024

cwfitzgerald commented May 8, 2024 •

edited by Wumpf

Vecvec commented May 9, 2024 •

edited

hecrj commented May 9, 2024 •

edited

hecrj commented May 11, 2024 •

edited

jimblandy commented May 17, 2024 •

edited

cwfitzgerald May 21, 2024 •

edited

jimblandy May 21, 2024 •

edited