Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gather/Scatter can be avoided if Source and Destination are of same type. #2778

Open
MkazemAkhgary opened this issue Feb 28, 2024 · 2 comments
Labels
Bugs Performance All issues related to performance/code generation

Comments

@MkazemAkhgary
Copy link

MkazemAkhgary commented Feb 28, 2024

here is a very simplified example

struct Foo { uint8 a, b, c, d, e, f, g, h; };

unmasked void Test(uniform Foo Dest[], uniform Foo Src[])
{
    // roughly 60 instructions on avx2
    Dest[programIndex] = Src[programIndex];
}

this could be avoided with

unmasked void Test(uniform Foo Dest[], uniform Foo Src[])
{
    // copy quad words instead.
    ((uniform int64 * uniform)Dest)[programIndex] = ((uniform int64 * uniform)Src)[programIndex];
}

Right now i'm using this optimization but to be safe I added this check

if(sizeof(uniform Foo) == 8) ((uniform int64 * uniform)Dest)[programIndex] = ((uniform int64 * uniform)Src)[programIndex];
else Dest[programIndex] = Src[programIndex];

But i think it would be nice if the compiler could optimize for this case and I believe it can adapt to structs of any size.

@dbabokin
Copy link
Collaborator

So, basically it need to be a memcpy and we need to recognize it when for the types data to avoid unpacking/packing.

@pbrubaker pbrubaker added Performance All issues related to performance/code generation Bugs labels Apr 14, 2024
@MkazemAkhgary
Copy link
Author

MkazemAkhgary commented Apr 15, 2024

So, basically it need to be a memcpy and we need to recognize it when for the types data to avoid unpacking/packing.

There is also the scenario where indices are not contiguous, in which case gather/scatter operations are more efficient with primitive types such as int64. For this test on AVX2, there are 291 instructions using Dest[i] = Src[i]; compared to 38 instructions using ((uniform int64 * uniform)Dest)[i] = ((uniform int64 * uniform)Src)[i];.

a more efficient gather/scatter can also be revised if indices are not contiguous but source and destination are of same type. that is, for each index, copy the struct from Src to Dest.

#pragma unroll
for(uniform int32 j = 0; j < programCount; j++)
{
    if (((lanemask() >> j) & 1) != 0)
    {
        Dest[extract(i, j)] = Src[extract(i, j)];
    }
}

in case of masked copies, vmaskmov can be used for 32 bit and 64 bit structs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugs Performance All issues related to performance/code generation
Projects
None yet
Development

No branches or pull requests

3 participants