gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

jstine35 · 2021-05-24T06:34:24Z

This change reverts d8479c7 and replaces it with a make-conditional that skips the offending templates in developer build environments, while still allowing them to be effective for user builds. Some SW renderer performance is lost when this setting is enabled, but build iteration time is much quicker and executable size is reduced. Good option for devs doing rapid iterations, or if a targeted platform needs a smaller executable.

Example:

make -j6 GPU_SW_COMPILE_FAST=1

naming convention

GPU_SW_COMPILE_FAST is tentative: I just picked a name out of a hat, and not sure I like it (maybe a bit long). I'm open to suggestions for a better naming scheme for this makefile argument.

This reverts commit d8479c7.

… Software GPU Some SW renderer performance is lost when this setting is enabled, but build iteration time is much quicker and executable size is reduced. Good option for devs doing rapid iterations, or if a targeted platform needs a smaller executable. Example: make -j6 GPU_SW_COMPILE_FAST=1

ZachCook · 2021-05-24T16:15:36Z

When enabled this forces MaskEvalTA true, needs a runtime check of g->GPU.MaskEvalAND for each function that has a MaskEvalTA template like you did with pgxp_runtime.

When GPU.MaskEvalAND is true command->func[][] at the bottom of ProcessFIFO() will go out of bounds, need to remove that bitwise OR when enabled.

Though should really make sure that it is better for performance to keep these template bools instead of only the runtime checks, I would think due to how much the binary size increases that the additional function template bools would lower the cache hit rate, I couldn't detect any performance difference myself.

jstine35 · 2021-05-25T13:41:27Z

Performance for this code is important for specific areas of specific games (usually menu screens, sadly). The specific case I was looking at recently while doing 2X uprender feature for swanstation was Gekido. It sets MaskEvalAND=0 and then renders several large quads across the whole screen. This caused the rasterizer to sample (read-back) from the framebuffer one very pixel unnecessarily and performance was quite poor as a result. Adding the MaskEval_TA sped things up by around 20% for this specific case.

What we are doing is the equivalent of building out a matrix of GPU shaders (pre-compiled matrix of GPU shaders is the main reason your typical Nvidia driver download is +500mb). Only some shaders are used for any given scene in any given game. The problem is that various game engines will have different preferred ways of doing things, and will use a slightly different set of shaders for the bulk of their work. It becomes necessary to bake in many varieties of shaders to cover the vast library of games on a PSX.

Cachelines absolutely do not matter in this context. Only a few KBs of the 3MB of pre-built shaders baked into the EXE will be used on any given frame. In fact cachelines should almost always benefit from these templates, because the majority of these custom shaders are much smaller than an "ubershader" meant to cover all functionality. For example, making pgxp a runtime option, because if its intrinsic alt-path code bloat, is going to be harder on L caches than compiling multiple versions with and without pgxp. (edit: also this is still not a great example - the main perf hit will be branch prediction, as the majority of pgxp code will be skipped by branch and thus still not cause much L-cache issue. But the branch predictor will suffer because it's also quite limited on most cpu architectures)

To be fair, this is all academic discussion: I doubt cachelines would matter even if some game did use a few dozen shader variants. Each shader is run thousands of times at once for any meaningful draw area coverage. The L-cache warmup would only impact the first iteration. Even if every shader was an L cache miss on the first pixel, we'd only see 0.005% perf hit except if a game had some special particle fx and was rendering each pixel using a different shader (likely impossible on PS1 anyway).

MaskEvalTA does not need to be evaluated at runtime. If it is enabled then it will be gated by the application of gpu->MaskEvalAND. On theory we could test MaskEvalAND prior to sampling the current texel, and thus remove the MaskEval_TA argument entirely (from either template or runtime). This would result in something inbetween:

overall SW performance for all games would be very slightly slower (maybe 1-2%)
edge cases like Gekido menus would be 12-17% faster (vs. 15-20%)

At a glance, I feel like Mednafen is specifying MaskEval_TA unnecessarily for a lot of function that don't use the parameter. This could be negatively impacting build times as well. Maybe the better approach is to remedy that issue. (but I think also some of this might be specific to pgxp, which I'm less familiar with than standard PS1 GPU rendering).

The array out of bounds I'll fix in a min. That was a last second foopah by me. We don't even need to change the array size. I just started thinking I should try to align the change more with the original patch and then tossed it in without testing. >_<

jstine35 · 2021-05-25T14:30:12Z

Pushed fix for reading past the end of array.

Another thing occurred to me - the majority of perf concerns are for Triangles only. Sprites are rarely used and lines as well. Chances are we can reduce template complexity for sprites and lines and get some build speed improvement, without meaningful performance hit since triangles would remain more highly optimized. But we also have to be weary that, I think, some games use orthogonal lines to render sprites/quads in clever ways sometimes, which means lots of lines and large coverage area and, by proxy, performance issues.

(unfortunately I don't know which offhand, only recalling the lamenting of other folks who have worked on these PS1 GPU emulation over the years).

ZachCook · 2021-05-25T19:45:30Z

PlotPixel, PlotNativePixel, DrawSpan, DrawSprite look like they change behavior with MaskEval_TA forced to 1 with these changes.

I definitely agree with reverting my fast compile hack with those runtime performance regressions, it might even be better to keep the longer compile times and larger binary size if this optimization would add too much divergence between user and developer builds, and being disabled by default there is going to be low usage.

I'll take a look at unused template parameters to see if we can get some of the compile speed and size benefits without regressions.

jstine35 · 2021-05-27T15:28:04Z

Ah, I see - the GPU is baking 0x8000 instead of reading gpu->MaskEvalAND -- because it knows the value can only be either 0 or 0x8000. (pushed a fix for that)

Another option is to disable inlining, though this would still be an extra makefile option. If we specify __attribute__((noinline)) on DrawSpan and PlotPixel the result on build speed should be similar - maybe better than - removing template parameters. It's still not something we'd want enabled normally tho.

Also .. is gcc using LTO optimization? It doesn't seem to be specified, but maybe GCC defaults to LTO now with -O3? 60+ seconds to link a 6MB executable is not something I'd expect unless LTO were enabled. If it is enabled, turning it off would be another way to get very rapid build iteration times in development, with low perf impact. Also something we only want to turn off for developer builds and enable when building Release/Distribution libs.

… is enabled

lgtm-com · 2021-05-27T15:46:02Z

This pull request introduces 2 alerts when merging 6ee97d8 into 9589ddd - view on LGTM.com

new alerts:

2 for Short-circuiting operator applied to flag

ZachCook · 2021-05-28T00:45:31Z

There is also the rsx_intf_push_* hw renderer stuff that uses MaskEval_TA too.

Does removing pgxp from the command_drawpolygon template slow down runtime perf of the user build too like MaskEval_TA does? That is a large source of compile time and binary size by itself.

inactive123 · 2021-07-22T02:07:17Z

Hi there @jstine35 , if you can fix the merge conflict I'd happily merge this.

ZachCook · 2021-08-07T22:53:21Z

I think now that I know the impact of removing the templates, that we don't really need this fast compile option, it will only make things harder to understand, maintain, and debug, without benefiting users in any way, and will see very little usage overall. Just reverting d8479c7, to get the performance back is all that is really required.

jstine35 added 2 commits May 23, 2021 22:56

Revert "Reduce compile time up to 57%, reduce binary size up to 32%"

3d71c96

This reverts commit d8479c7.

gpu: simplify quick-build option and fix array bounds bug

4e0c81f

gpu: slowpath eval of gpu->MaskEvalAND when rapid-compilation feature…

6ee97d8

… is enabled

ZachCook mentioned this pull request Aug 24, 2021

Dynarec updates, and revert my bad gpu optimization #812

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

jstine35 commented May 24, 2021

ZachCook commented May 24, 2021 •

edited

jstine35 commented May 25, 2021 •

edited

jstine35 commented May 25, 2021

ZachCook commented May 25, 2021

jstine35 commented May 27, 2021 •

edited

lgtm-com bot commented May 27, 2021

ZachCook commented May 28, 2021

inactive123 commented Jul 22, 2021

ZachCook commented Aug 7, 2021

gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

Are you sure you want to change the base?

gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

Conversation

jstine35 commented May 24, 2021

naming convention

ZachCook commented May 24, 2021 • edited

jstine35 commented May 25, 2021 • edited

jstine35 commented May 25, 2021

ZachCook commented May 25, 2021

jstine35 commented May 27, 2021 • edited

lgtm-com bot commented May 27, 2021

ZachCook commented May 28, 2021

inactive123 commented Jul 22, 2021

ZachCook commented Aug 7, 2021

ZachCook commented May 24, 2021 •

edited

jstine35 commented May 25, 2021 •

edited

jstine35 commented May 27, 2021 •

edited