Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster swizzle code #1608

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

NZJenkins
Copy link

@NZJenkins NZJenkins commented Mar 19, 2024

This speeds up swizzling/unswizzling textures.

fill_pattern showed up in profiling Fable: The Lost Chapters.

It seems to implement the equivalent of the pdep instruction.
It turns out there is a slightly faster method to do this in a fixed number of steps.
This increases fps a bit in certain scenes in Fable TLC.

Also use native pdep if available.

Instead of calculating these offsets from masks, we can increment through the offsets.
See https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/

I wrote a small testbench for different methods to ensure it's working the same way (not integrated into xemu)
It should be around 30X faster and is matching both swizzle/unswizzle behaviour of the original code.

width: 256 (67MB)
height: 256
depth: 256
bpp: 4
iterations: 10
...
simple_avg_ms: 1284.55 (old code)
...
fabien_avg_ms: 36.2146  (new code)

Gives a few more FPS in certain situations in Fable TLC:

Fable TLC FPS comparison

Before:
master

After:
quick_swizzle

@NZJenkins
Copy link
Author

run time in seconds for each implementation (using MSVC):

int main() {
    int width = 2048;
    int height = 2048;
    int bpp = 1;
    int pitch = bpp * width;
    uint8_t* garbage_in = (uint8_t*)malloc(width * height * bpp);
    uint8_t* garbage_out = (uint8_t*)malloc(width * height * bpp);

    for (int i = 0; i < 1000; i++) {
        swizzle_rect(garbage_in, width, height, garbage_out, pitch, bpp);
        unswizzle_rect(garbage_out, width, height, garbage_in, pitch, bpp);
    }
}
bench_fill_pattern.exe
182.799363
bench_expand.exe
36.3102008
bench_pdep.exe
6.3883577

Copy link
Member

@mborgerson mborgerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a big improvement in this swizzle logic performance, nice! In addition to the minor feedback on code, a couple things:

  • Similar to the benchmark, please also test to confirm that the swizzle/unswizzle results are the same in all cases: current impl, new faster sw impl, and new instruction impl
  • Please determine how well this instruction will be supported on user systems. Find out which generation of x86 processors includes this instruction, and if it is relatively commonplace now, confirm that our release builds are actually building for it (related: Enable additional architectural optimizations in release #1492)

mask = (mask ^ mv) | (mv >> (1 << i)); // Compress m.
mk = mk & ~mp;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space

generate_swizzle_masks(width, height, depth, &mask_x, &mask_y, &mask_z);
expand_mask mask_x, mask_y, mask_z;
generate_swizzle_masks(width, height, depth, &mask_x.mask, &mask_y.mask, &mask_z.mask);
generate_expand_mask_moves(&mask_x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculation unnecessary in case of pdep instruction?

* expand(0000abcd, 10011010) = a00bc0d0.
*
* Implementation from Hacker's delight chapter 7 "Expand"
* https://stackoverflow.com/questions/77834169/what-is-a-fast-fallback-algorithm-which-emulates-pdep-and-pext-in-software
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Citing the book should be enough, don't need the stackoverflow.com link

}

static uint32_t expand(uint32_t x, expand_mask* expand_mask) {
#ifdef __BMI2__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Readability would be better with these macros unindented

uint32_t moves[5];
} expand_mask;

static void generate_expand_mask_moves(expand_mask* expand_mask) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Asterisk goes next to identifier

based on https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/
- Increment swizzled offsets, instead of calculating them from scratch for each pixel.
This seems to be around 30X faster.
- Add/update comments, including high-level comment on what swizzle.c is for.

This improves FPS in certain areas in Fable: The Lost Chapters
@NZJenkins NZJenkins changed the title Slightly faster swizzle code Faster swizzle code Mar 29, 2024
@NZJenkins
Copy link
Author

I've changed the implementation to get the same performance without BMI2, and have updated the PR description

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants