Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebGLRenderer: Tight morph target packing. #27768

Open
wants to merge 5 commits into
base: dev
Choose a base branch
from

Conversation

wizgrav
Copy link
Contributor

@wizgrav wizgrav commented Feb 19, 2024

While optimizing morph target performance for this demo I noticed that positions and normals carry an unused component.

This PR changes the morph target texture from rgba -> single float to enable fetching only the xyz components for positions and normals while for color it still stores and fetches the full rgba set.

It is functionally identical as before but gives 25% better fps (23 vs 18) on my laptops 3060 in this example which I also include in the PR. If you don't want it I can take it out but it would be nice to have something to stress test morph targets.

Copy link

github-actions bot commented Feb 19, 2024

📦 Bundle size

Full ESM build, minified and gzipped.

Filesize dev Filesize PR Diff
676.4 kB (168.3 kB) 676.8 kB (168.3 kB) +374 B

🌳 Bundle size after tree-shaking

Minimal build including a renderer, camera, empty scene, and dependencies.

Filesize dev Filesize PR Diff
457.3 kB (111 kB) 457.7 kB (111.1 kB) +376 B

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 19, 2024

It seems the webgl_loader_mmd example fails with this PR. Do you mind verifying?

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 19, 2024

One example that demonstrates instanced morph targets is sufficient. Do you mind modifying webgl_instancing_morph instead of adding a new demo?

int y = texelIndex / morphTargetsTextureSize.x;
int x = texelIndex - y * morphTargetsTextureSize.x;

ivec3 morphUV = ivec3( x, y, morphTargetIndex );
return texelFetch( morphTargetsTexture, morphUV, 0 );
Copy link
Collaborator

@Mugen87 Mugen87 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are multiple fetches really faster than a single one? I'm afraid this needs to be tested one more than one device. Especially mobile devices could show a performance regression here.

Copy link
Contributor Author

@wizgrav wizgrav Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is currently with rgba format it will always fetch 4 floats so fetching 3 instead should always be faster especially on mobiles where bandwidth is precious. I'll replace the horse example with the faces one and do some more testing but I only have one android phone with an older snapdragon and the quest 3 which all show similar improvements

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could also add a slider to webgl_instancing_morph that controls the number of instances.

Copy link
Contributor Author

@wizgrav wizgrav Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have the faces instead because they are using more influences and also morph the normals so its a more complete case than the horses.

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 19, 2024

It seems the webgl_loader_mmd example fails with this PR. Do you mind verifying?

I saw that it's failing when checking the screenshot with "Diff wrong in 0.4% of pixels in file: webgl_loader_mmd" but this is irrelevant to any of the files in this PR. I've seen it happening in the past and usually triggering the tests again with a commit fixes it but I've done it once and it didn't. I've checked the files on my repo and they look identical on the eye, maybe its a transparency issue? I'm a bit lost on this one, any help would be appreciated

@@ -20,12 +20,34 @@ export default /* glsl */`

vec4 getMorph( const in int vertexIndex, const in int morphTargetIndex, const in int offset ) {

int texelIndex = vertexIndex * MORPHTARGETS_TEXTURE_STRIDE + offset;
int texelIndex = vertexIndex * MORPHTARGETS_TEXTURE_STRIDE + 3 * offset;
Copy link
Collaborator

@Mugen87 Mugen87 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to multiply the offset with 3 now?

Copy link
Contributor Author

@wizgrav wizgrav Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep in line with what we previously had on the shader chunks that call getMorph. Now the position is at offset 0, normals at offset 3 and colors at offset 6. Alternatively we can modify the other shader chunks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid any misunderstanding, you want me to change the other shader chunks to eliminate the multiplication with 3? At some point we will probably have to revisit that part anyway if/when we add additional morphing attributes like uv. When that time comes more extensive changes on how we handle morphing will be needed because now it's a bit fixed like if there's a color morph but not normals the latter will be empty space in the texture. This is how it used to be even before this PR but we can make it more flexible when more attributes are introduced

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the current version of this PR 👍 .

@Mugen87 Mugen87 changed the title Tight morph target packing WebGLRenderer: Tight morph target packing. Feb 19, 2024
@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 19, 2024

I've checked the files on my repo and they look identical on the eye, maybe its a transparency issue? I'm a bit lost on this one, any help would be appreciated

I've downloaded your branch and tested locally. I can clearly see an artifact at Miku's skirt. You can also see this when going to webgl_loader_mmd_pose and then set the 笑い morph target influence to 1. Notice the black triangle on the left bottom side of the skirt.

image

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 19, 2024

@Mugen87 that artifact was caused by an incorrect calculation of the texture width when it exceeded the hardware maximum. Its fixed now, I did quite a bit of testing and it seems solid. I've also added some sliders on the example to tweak the instance count and set it up to start conservative so its much more useful now for testing

Mugen87
Mugen87 previously approved these changes Feb 20, 2024
Copy link
Collaborator

@Mugen87 Mugen87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good now! The only issue is that performance testing isn't really possible since there is no example that allows to compare the existing and new approach. Especially since the reference example has been updated with this PR. Given the advantage in bandwidth cost I still think it's worth to given this new approach a try.

@Mugen87 Mugen87 added this to the r162 milestone Feb 20, 2024
@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 20, 2024

@Mugen87 I've set up the example with the current dev to compare

Current Dev - RGBA format - Fetching 4 floats

This PR - Red format- Fetching 3 floats

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 20, 2024

On a Pixel 4a with latest Chrome I see this with the default grid settings:

Current Dev: 23 FPS
This PR: 20 FPS

On an iPad (8.Gen) with Safari with max grid settings:

Current Dev: 24 FPS
This PR: 23 FPS

So I can't confirm based on both devices that the new implementation is in general faster. Fetching a single RGBA texel seems to be faster than fetching 3 Red texels on the above devices.

@Mugen87 Mugen87 dismissed their stale review February 20, 2024 10:20

Performance regressions on mobile devices.

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 20, 2024

Sry, I have to revert my approval. When can't merge the PR if we see performance degradations. This needs more testing especially since most devices outside there are mobile.

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 20, 2024

Sry, I have to revert my approval. When can't merge the PR if we see performance degradations. This needs more testing especially since most devices outside there are mobile.

No worries that's why I set up the examples. At least on my phone with an adreno 640 I see consistent improvement on all grid size eg on 3 x 3 grid it's 34 vs 31 fps

I believe you run into thermal throttling issues but yeah we can keep it in testing for a bit. If you want we can at least include this example instead of the horse one as its way more useful for testing

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 20, 2024

If you want we can at least include this example instead of the horse one as its way more useful for testing

That would be great!

@mrdoob
Copy link
Owner

mrdoob commented Feb 20, 2024

Pixel 8 Pro

dev pr
Screenshot_20240220-212427 Screenshot_20240220-212513

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 20, 2024

@Mugen87 @mrdoob yeah your results seem to be consistent with what I get as well but I made a shocking discovery. I tried the examples on a Quest 3 where current dev gives 40 fps while this PR gives almost 90 with the default grid.

This can't be attributed to bandwidth savings alone there seems to be an issue with fetching rgba floats there and if that's the case it probably affects skinned meshes as well as the bone matrices are also stored as rgba. @cabanier can you verify? I will investigate further

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 20, 2024

@Mugen87 I did a lot of testing including stress testing skinned meshes and it seems that the quest exhibits this behaviour only when used with instanced morph targets with the current dev. The only thing that comes to my mind is that because we are fetching the weights from a single channel texture and then the morph attributes from rgba ones, this triggers some issue in that platform/driver. With this PR where both textures are single channel it seems that it doesn't

In any case, on all my devices I can only see improvement and I think @mrdoob also has a similar experience from what he posted. How should I handle it? Should I clean this PR and leave only the example, set up the cases live and open an issue for people to test. Or should I leave this PR open and somehow invite people to share here? Please advise

@Mugen87
Copy link
Collaborator

Mugen87 commented Feb 20, 2024

Or should I leave this PR open and somehow invite people to share here?

Let's do this. It would be good if a few more people could test both links (ideally with mobile devices) and report performance data. If the majority of devices show an improvement, I'm okay with approving the PR.

@CodyJasonBennett
Copy link
Contributor

CodyJasonBennett commented Feb 20, 2024

Tested on Pixel 3a and iPhone 13. Can benchmark on Quest 2-3 later.

Show results:

Pixel 3a, Chrome 121, Android 12

Dev PR
dev PR

iPhone 13, iOS 17.3.1

Dev PR
dev PR

@cabanier
Copy link
Contributor

I tried the examples on a Quest 3 where current dev gives 40 fps while this PR gives almost 90 with the default grid.

That is quite the improvement!

This can't be attributed to bandwidth savings alone there seems to be an issue with fetching rgba floats there and if that's the case it probably affects skinned meshes as well as the bone matrices are also stored as rgba.

I'm out of town this week with no access to a device. Can you describe the change you made? I can forward to our team that works on the low level graphics stack to get their opinion.

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 21, 2024

@cabanier

The context is that we are instancing meshes that do morph target animations. Each instance has its own set of weights for the morph targets.

In the current dev we fetch some floats from a single channel texture and fill an array with the weights for the morph targets, then we iterate that array and if an element is > 0 we fetch one or more rgba texels from an array texture to get the attributes for the morphing(position/normal/color). shaderchunk

In this PR we still fetch the weights as before but we convert the attribute array texture to single channel as well and fetch the xyz components for position/normal individually to save bandwidth because in the rgba case we have one component which is always set to zero. shaderchunk

I tested the examples on the Quest 3 and the speedup(>100%) is way more than what we could theoretically achieve which leads me to believe that the setup of the current dev causes some issue with the stack/shader compilation which degrades performance

You can use the links in this comment for reference #27768 (comment)

light.castShadow = true;
const renderer = new THREE.WebGLRenderer( {
antialias: true,
powerPreference: 'high-performance'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to set this to 'high-performance'?

@cabanier
Copy link
Contributor

We're still looking into why this speedup might happen but someone noticed that you added high-performance to the WebGLRenderer.
Is it still faster if you disable that feature?

@wizgrav
Copy link
Contributor Author

wizgrav commented Feb 23, 2024

We're still looking into why this speedup might happen but someone noticed that you added high-performance to the WebGLRenderer. Is it still faster if you disable that feature?

I tried removing it on both the current dev and this PR, I even set it to "low-power" but it didn't make a difference. From my understanding this hint applies to dual gpu setups and instructs the browser to select the better one for the webgl context. Like eg in laptops where you have an intel alongside an nvidia. You use it in your stack to enable optimizations?

From my limited knowledge on graphic stacks, I suspect that the slowdown in the current dev happens because different texture formats are used and the stack has to do some reconfiguration on the fly to switch the indexing from one format to the other and this is expensive. But again my knowledge is limited.

@cabanier
Copy link
Contributor

From my limited knowledge on graphic stacks, I suspect that the slowdown in the current dev happens because different texture formats are used and the stack has to do some reconfiguration on the fly to switch the indexing from one format to the other and this is expensive. But again my knowledge is limited.

FYI, we're still looking into why this is making such a huge difference. We are suspecting that this could have a more efficient caching strategy but haven't proved it yet.

@RenaudRohlinger
Copy link
Collaborator

FYI I implemented this PR in the WebGPURenderer and did run some test in both WebGPU and WebGL Backends and using Tight packing via Red Channel seems to increase by 5 to 10% the GPU usage.

#27984

@wizgrav
Copy link
Contributor Author

wizgrav commented Mar 24, 2024

FYI I implemented this PR in the WebGPURenderer and did run some test in both WebGPU and WebGL Backends and using Tight packing via Red Channel seems to increase by 5 to 10% the GPU usage.

#27984

This is for apple silicon from what I can see in your PR, there are probably alignment issues there but other architectures behave differently. On intel/nvidia/amd and some mobile chipsets we see a good speedup. On the Q3 specifically, and maybe snapdragons in general, we see a major regression when we mix single and multi channel float textures, as the current dev does, in the vertex shader. I'm still waiting for some insight by meta and @rcabanier on this

@mrdoob mrdoob modified the milestones: r163, r164 Mar 29, 2024
@mrdoob mrdoob modified the milestones: r164, r165 Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants