Faster swizzle code #1608

NZJenkins · 2024-03-19T09:07:45Z

This speeds up swizzling/unswizzling textures.

fill_pattern showed up in profiling Fable: The Lost Chapters.

It seems to implement the equivalent of the pdep instruction.
It turns out there is a slightly faster method to do this in a fixed number of steps.
This increases fps a bit in certain scenes in Fable TLC.

~~Also use native pdep if available.~~

Instead of calculating these offsets from masks, we can increment through the offsets.
See https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/

I wrote a small testbench for different methods to ensure it's working the same way (not integrated into xemu)
It should be around 30X faster and is matching both swizzle/unswizzle behaviour of the original code.

width: 256 (67MB)
height: 256
depth: 256
bpp: 4
iterations: 10
...
simple_avg_ms: 1284.55 (old code)
...
fabien_avg_ms: 36.2146  (new code)

Gives a few more FPS in certain situations in Fable TLC:

Fable TLC FPS comparison

Before:

After:

NZJenkins · 2024-03-19T10:05:53Z

run time in seconds for each implementation (using MSVC):

int main() {
    int width = 2048;
    int height = 2048;
    int bpp = 1;
    int pitch = bpp * width;
    uint8_t* garbage_in = (uint8_t*)malloc(width * height * bpp);
    uint8_t* garbage_out = (uint8_t*)malloc(width * height * bpp);

    for (int i = 0; i < 1000; i++) {
        swizzle_rect(garbage_in, width, height, garbage_out, pitch, bpp);
        unswizzle_rect(garbage_out, width, height, garbage_in, pitch, bpp);
    }
}

bench_fill_pattern.exe
182.799363
bench_expand.exe
36.3102008
bench_pdep.exe
6.3883577

mborgerson

Looks like a big improvement in this swizzle logic performance, nice! In addition to the minor feedback on code, a couple things:

Similar to the benchmark, please also test to confirm that the swizzle/unswizzle results are the same in all cases: current impl, new faster sw impl, and new instruction impl
Please determine how well this instruction will be supported on user systems. Find out which generation of x86 processors includes this instruction, and if it is relatively commonplace now, confirm that our release builds are actually building for it (related: Enable additional architectural optimizations in release #1492)

mborgerson · 2024-03-20T06:09:56Z

hw/xbox/nv2a/swizzle.c

+        mask = (mask ^ mv) | (mv >> (1 << i)); // Compress m.
+        mk = mk & ~mp;
+    }
+


mborgerson · 2024-03-20T06:16:58Z

hw/xbox/nv2a/swizzle.c

-    generate_swizzle_masks(width, height, depth, &mask_x, &mask_y, &mask_z);
+    expand_mask mask_x, mask_y, mask_z;
+    generate_swizzle_masks(width, height, depth, &mask_x.mask, &mask_y.mask, &mask_z.mask);
+    generate_expand_mask_moves(&mask_x);


Calculation unnecessary in case of pdep instruction?

mborgerson · 2024-03-20T06:17:31Z

hw/xbox/nv2a/swizzle.c

+ * expand(0000abcd, 10011010) = a00bc0d0.
+ *
+ * Implementation from Hacker's delight chapter 7 "Expand"
+ * https://stackoverflow.com/questions/77834169/what-is-a-fast-fallback-algorithm-which-emulates-pdep-and-pext-in-software


Citing the book should be enough, don't need the stackoverflow.com link

mborgerson · 2024-03-20T06:20:21Z

hw/xbox/nv2a/swizzle.c

+}
+
+static uint32_t expand(uint32_t x, expand_mask* expand_mask) {
+    #ifdef __BMI2__


nit: Readability would be better with these macros unindented

mborgerson · 2024-03-20T06:36:17Z

hw/xbox/nv2a/swizzle.c

+    uint32_t moves[5];
+} expand_mask;
+
+static void generate_expand_mask_moves(expand_mask* expand_mask) {


nit: Asterisk goes next to identifier

based on https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/ - Increment swizzled offsets, instead of calculating them from scratch for each pixel. This seems to be around 30X faster. - Add/update comments, including high-level comment on what swizzle.c is for. This improves FPS in certain areas in Fable: The Lost Chapters

NZJenkins · 2024-03-29T10:01:53Z

I've changed the implementation to get the same performance without BMI2, and have updated the PR description

mborgerson reviewed Mar 20, 2024

View reviewed changes

NZJenkins force-pushed the quick_swizzle branch from af61911 to 2d8596c Compare March 29, 2024 09:39

NZJenkins changed the title ~~Slightly faster swizzle code~~ Faster swizzle code Mar 29, 2024

NZJenkins requested a review from mborgerson March 29, 2024 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster swizzle code #1608

Faster swizzle code #1608

NZJenkins commented Mar 19, 2024 •

edited

NZJenkins commented Mar 19, 2024

mborgerson left a comment •

edited

mborgerson Mar 20, 2024

mborgerson Mar 20, 2024

mborgerson Mar 20, 2024

mborgerson Mar 20, 2024

mborgerson Mar 20, 2024

NZJenkins commented Mar 29, 2024

Faster swizzle code #1608

Are you sure you want to change the base?

Faster swizzle code #1608

Conversation

NZJenkins commented Mar 19, 2024 • edited

NZJenkins commented Mar 19, 2024

mborgerson left a comment • edited

Choose a reason for hiding this comment

mborgerson Mar 20, 2024

Choose a reason for hiding this comment

mborgerson Mar 20, 2024

Choose a reason for hiding this comment

mborgerson Mar 20, 2024

Choose a reason for hiding this comment

mborgerson Mar 20, 2024

Choose a reason for hiding this comment

mborgerson Mar 20, 2024

Choose a reason for hiding this comment

NZJenkins commented Mar 29, 2024

NZJenkins commented Mar 19, 2024 •

edited

mborgerson left a comment •

edited