Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EGL_KHR_partial_update for FBOs #547

Open
emersion opened this issue Oct 27, 2022 · 23 comments
Open

EGL_KHR_partial_update for FBOs #547

emersion opened this issue Oct 27, 2022 · 23 comments

Comments

@emersion
Copy link
Contributor

emersion commented Oct 27, 2022

EGL_KHR_partial_update allows tilers to more efficiently re-use a surface buffer as a render target. It assumes EGLSurface is used, and doesn't support FBOs.

It would be nice to have an equivalent extension for FBOs.

@stonesthrow
Copy link
Contributor

To be clear, you want to specify a rectangle(s) for renderbuffers or textures that are render targets, like a scissor or similar to QCOM's: https://registry.khronos.org/OpenGL/extensions/QCOM/QCOM_tiled_rendering.txt ? And a tiler would limit its fetch and write back to those rectangles? saving some bandwidth - being the benefit here?

@emersion
Copy link
Contributor Author

Yeah, I think that's an accurate description, cc @fooishbar.

@stonesthrow
Copy link
Contributor

@pdaniell-nv , Piers, lets put this on the GLEs/EGL agenda for discussion. thx

@pdaniell-nv
Copy link
Contributor

We discussed this briefly and we're not sure there is a strong use case for FBOs. For EGL there can be more than one context accessing the surface, so the window-level extension makes sense. But for FBOs, which are within the scope of a GL context, it seems all the information needed is already available, and there are no external factors to keep track of.

@emersion
Copy link
Contributor Author

I'm using FBOs in a Wayland compositor. The FBOs are bound to an EGLImageKHR which comes from a DMA-BUF, shared with Wayland client or the kernel.

@stonesthrow
Copy link
Contributor

Not my expertise, but I would expect that the binning process would exclude any area that doesn't have a draw, so no tiles would be fetched or written for that area. So it should be automatic- no?

@cubanismo
Copy link
Contributor

EGL_KHR_partial_update was for communicating a damage region, or scissor effectively, between an EGL surface and the thing where that surface was presented. E.g., between a wayland client and a wayland compositor, or a wayland compositor and a DRM-KMS display that supported self-refresh and/or partial shadow framebuffer updates of some sort. If it helped some tiler logic, I'm pretty sure that was by accident, and I'm curious how it helped. Regardless, that doesn't sound anything like a tiler spilling optimization thing described above, which as noted, should be local and shouldn't care whether an EGLImage wrapping a dma-buf or a simple texture is being used, and should be accomplishable with existing GL mechanisms.

Are you instead trying to avoid a tiler flush/writeback between the client rendering and the compositor texturing or something? That seems like it would be way more complicated than a partial update spec unless the hardware does something like that automatically internally at the HW or firmware level, in which case it seems like we'd need more info on exactly what the tiler needs to accomplish that beyond what it has now.

Now I'm just guessing at what's going on though. Can you be more specific about what it is that goes wrong when you switch from using an EGLSurface to an FBO in the wayland compositor code?

@fooishbar
Copy link

E.g., between a wayland client and a wayland compositor, or a wayland compositor and a DRM-KMS display that supported self-refresh and/or partial shadow framebuffer updates of some sort.

You're thinking of EGL_EXT_swap_buffers_with_damage.

If it helped some tiler logic, I'm pretty sure that was by accident, and I'm curious how it helped.

That was absolutely the intent behind and the cause for EGL_KHR_partial_update existing, yeah. It enables some optimisations to be done at the draw-call level - particularly avoiding unnecessary tile reload/flush which is always expensive. swap_buffers_with_damage goes all the way to the winsys, but partial_update pretty much stays local to the client.

@stonesthrow
Copy link
Contributor

I was concerned from beginning that details for partial_update, swap_with_damage, buffer age, and window system particulars would confuse the specific use case here. Hence the need to clarify.

@cubanismo
Copy link
Contributor

You're thinking of EGL_EXT_swap_buffers_with_damage.

You're right. Now I remember constantly confusing the two even when we were working on them as well. I'm still not clear what additional API is being requested though.

@emersion
Copy link
Contributor Author

emersion commented Dec 5, 2022

I'm still not clear what additional API is being requested though.

One can port code from EGLSurface to FBOs, however one looses support for EGL_KHR_partial_update (regressing perf on tilers) in the process. This is what happened in wlroots.

What I'm requesting is an equivalent of EGL_KHR_partial_update designed for FBOs instead of EGLSurface.

@stonesthrow
Copy link
Contributor

Your render target has the contents of the previous draw, so you know it preserved, and you don't need to account for 'n' previous frames.
But as I said, I think the binning pass of a tiler will limit tile fetch and write back to just the tiles that have a draw update.
Do you have some indication that areas without update are being fetched? That's a hard one to detect I would think.

@emersion
Copy link
Contributor Author

emersion commented Dec 5, 2022

Your render target has the contents of the previous draw, so you know it preserved, and you don't need to account for 'n' previous frames.

No, my render target has the contents of the n-th previous draw. I have my own swapchain implementation with triple buffering.

I'm unsure how to check whether the tiler is doing some useless fetches or not.

@stonesthrow
Copy link
Contributor

I see. You are cycling through n-textures like a swapchain. OK.
Only if a profiler shows you the actual tiles that have a draw, would you be able to detect. I am not sure that is available.
I would need to check with our guys about empty bin skipping and profiler information.
Maybe other tile implementations could chime in, because I'm fairly sure about the binning process and that everyone does it.

@emersion
Copy link
Contributor Author

emersion commented Dec 5, 2022

Okay, that's helpful!

I'd definitely be happy to drop this issue if Mesa drivers already Do The Right Thing™. I'm not too knowledgeable about tilers, only happened to notice the discrepancy between EGLSurface and FBOs.

@cubanismo
Copy link
Contributor

What I'm requesting is an equivalent of EGL_KHR_partial_update designed for FBOs instead of EGLSurface.

Right, but what do you propose as an API for that? I'm unclear what information conveyance is missing. As Jeff says, there are other APIs (I'm not clear why glScissor() is insufficient, but if it is, I would assume the QCOM ext is sufficient?) to convey the concept of a partial framebuffer update. The EGL partial update extension seems mostly useful to plumb some data in before OpenGL commands trigger a readback, but when you're by definition using only GL as a client API, it doesn't seem like there needs to be any winsys-level API involved, since it doesn't have any special surface-level knowledge to add.

I could still be wrong. I've always had trouble with the partial update concept, so I'm just trying to understand what is missing at the API level Vs. implementation level. Hopefully Jeff is right and all is already well. Are you seeing perf or power degradation after switching up the code in wlroots, or just assuming things are suboptimal?

@emersion
Copy link
Contributor Author

emersion commented Dec 5, 2022

Hm, my understanding was that indeed drivers need the information ahead-of-time, and that glScissors (or extensions to batch these) aren't enough. wlroots does things like this:

glBindFramebuffer(GL_FRAMEBUFFER, fbo);
glScissor(0, 0, 42, 42);
// draw…
glScissor(100, 100, 42, 42);
// draw again…

The second scissor here comes "too late" AFAIU.

The API could look like this:

void glSetFramebufferDamageRegion(GLenum target, EGLint *rects, EGLint n_rects);

called right after glBindFramebuffer for instance.

Are you seeing perf or power degradation after switching up the code in wlroots, or just assuming things are suboptimal?

It's tricky to tell, because wlroots has switched to FBOs a long time ago. I will try to compare wlroots and Weston, but not sure the results will be meaningful.

@cubanismo
Copy link
Contributor

cubanismo commented Dec 5, 2022

The second scissor here comes "too late" AFAIU.

OK, I can see how a multi-region command might allow larger batches, but I would have assumed with two scissors, you just get two batches of rendering corresponding to each scissored set of draws. I'm curious whether the tilers actually do more with multiple regions, or whether they just union them into a rect spanning both.

Regardless, it does seem worth validating that there is something to gain here before adding API. Either by inspection of driver code/GPU specs by the relevant IHVs, or via perf measurements.

@stonesthrow
Copy link
Contributor

" I'm curious whether the tilers actually do more with multiple regions, or whether they just union them into a rect spanning both."
When implementing partial_update in the past, I just unioned the areas.

@stonesthrow
Copy link
Contributor

Imagination Tech and Qualcomm also think that bins that have no draw will not be fetched or drawn, but that can be invalidated with a clear. If RenderDoc or something can indicate the bins - unknown, its a very internal optimization feature.

@zzag
Copy link

zzag commented Dec 7, 2022

Regardless, it does seem worth validating that there is something to gain here before adding API. Either by inspection of driver code/GPU specs by the relevant IHVs, or via perf measurements.

What can one use to benchmark this kind of thing? In kwin, I believe that we merged partial update support while taking "it improves performance" for granted.

@janharaldfredriksen-arm
Copy link
Contributor

Imagination Tech and Qualcomm also think that bins that have no draw will not be fetched or drawn, but that can be invalidated with a clear.

Same for Arm (except for some older GPUs). We could still get some benefit from partial updates, but probably a micro-optimization at this point.

What can one use to benchmark this kind of thing?

Easiest if you could get vendor specific performance counters, I expect. In extreme cases, say updating a single pixel in a large resolution frame, you may see performance impact if an implementation fetched the full frame vs only the modified bin/area.

@emersion
Copy link
Contributor Author

emersion commented Jun 8, 2023

@jadahl has pointed me to a Mutter merge request from Erico Nunes:

I checked the impact of this with the lima driver using the hardware performance counters for number of memory reads and writes done by the GPU.
There is a drop in those numbers, sometimes significant, but it seems that with mutter only a few frames benefit from this.
It is not clear to me yet why other frames are not affected. If there are any hints, please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants
@cubanismo @emersion @fooishbar @janharaldfredriksen-arm @pdaniell-nv @stonesthrow @zzag and others