Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu: add make option GPU_SW_COMPILE_FAST=1 to perform fast compile of Software GPU #794

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

jstine35
Copy link

This change reverts d8479c7 and replaces it with a make-conditional that skips the offending templates in developer build environments, while still allowing them to be effective for user builds. Some SW renderer performance is lost when this setting is enabled, but build iteration time is much quicker and executable size is reduced. Good option for devs doing rapid iterations, or if a targeted platform needs a smaller executable.

Example:

make -j6 GPU_SW_COMPILE_FAST=1

naming convention

GPU_SW_COMPILE_FAST is tentative: I just picked a name out of a hat, and not sure I like it (maybe a bit long). I'm open to suggestions for a better naming scheme for this makefile argument.

… Software GPU

Some SW renderer performance is lost when this setting is enabled, but build iteration time is much quicker and executable size is reduced. Good option for devs doing rapid iterations, or if a targeted platform needs a smaller executable.

Example:
  make -j6 GPU_SW_COMPILE_FAST=1
@ZachCook
Copy link
Collaborator

ZachCook commented May 24, 2021

When enabled this forces MaskEvalTA true, needs a runtime check of g->GPU.MaskEvalAND for each function that has a MaskEvalTA template like you did with pgxp_runtime.

When GPU.MaskEvalAND is true command->func[][] at the bottom of ProcessFIFO() will go out of bounds, need to remove that bitwise OR when enabled.

Though should really make sure that it is better for performance to keep these template bools instead of only the runtime checks, I would think due to how much the binary size increases that the additional function template bools would lower the cache hit rate, I couldn't detect any performance difference myself.

@jstine35
Copy link
Author

jstine35 commented May 25, 2021

Performance for this code is important for specific areas of specific games (usually menu screens, sadly). The specific case I was looking at recently while doing 2X uprender feature for swanstation was Gekido. It sets MaskEvalAND=0 and then renders several large quads across the whole screen. This caused the rasterizer to sample (read-back) from the framebuffer one very pixel unnecessarily and performance was quite poor as a result. Adding the MaskEval_TA sped things up by around 20% for this specific case.

What we are doing is the equivalent of building out a matrix of GPU shaders (pre-compiled matrix of GPU shaders is the main reason your typical Nvidia driver download is +500mb). Only some shaders are used for any given scene in any given game. The problem is that various game engines will have different preferred ways of doing things, and will use a slightly different set of shaders for the bulk of their work. It becomes necessary to bake in many varieties of shaders to cover the vast library of games on a PSX.

Cachelines absolutely do not matter in this context. Only a few KBs of the 3MB of pre-built shaders baked into the EXE will be used on any given frame. In fact cachelines should almost always benefit from these templates, because the majority of these custom shaders are much smaller than an "ubershader" meant to cover all functionality. For example, making pgxp a runtime option, because if its intrinsic alt-path code bloat, is going to be harder on L caches than compiling multiple versions with and without pgxp. (edit: also this is still not a great example - the main perf hit will be branch prediction, as the majority of pgxp code will be skipped by branch and thus still not cause much L-cache issue. But the branch predictor will suffer because it's also quite limited on most cpu architectures)

To be fair, this is all academic discussion: I doubt cachelines would matter even if some game did use a few dozen shader variants. Each shader is run thousands of times at once for any meaningful draw area coverage. The L-cache warmup would only impact the first iteration. Even if every shader was an L cache miss on the first pixel, we'd only see 0.005% perf hit except if a game had some special particle fx and was rendering each pixel using a different shader (likely impossible on PS1 anyway).

MaskEvalTA does not need to be evaluated at runtime. If it is enabled then it will be gated by the application of gpu->MaskEvalAND. On theory we could test MaskEvalAND prior to sampling the current texel, and thus remove the MaskEval_TA argument entirely (from either template or runtime). This would result in something inbetween:

  • overall SW performance for all games would be very slightly slower (maybe 1-2%)
  • edge cases like Gekido menus would be 12-17% faster (vs. 15-20%)

At a glance, I feel like Mednafen is specifying MaskEval_TA unnecessarily for a lot of function that don't use the parameter. This could be negatively impacting build times as well. Maybe the better approach is to remedy that issue. (but I think also some of this might be specific to pgxp, which I'm less familiar with than standard PS1 GPU rendering).

The array out of bounds I'll fix in a min. That was a last second foopah by me. We don't even need to change the array size. I just started thinking I should try to align the change more with the original patch and then tossed it in without testing. >_<

@jstine35
Copy link
Author

Pushed fix for reading past the end of array.

Another thing occurred to me - the majority of perf concerns are for Triangles only. Sprites are rarely used and lines as well. Chances are we can reduce template complexity for sprites and lines and get some build speed improvement, without meaningful performance hit since triangles would remain more highly optimized. But we also have to be weary that, I think, some games use orthogonal lines to render sprites/quads in clever ways sometimes, which means lots of lines and large coverage area and, by proxy, performance issues.

(unfortunately I don't know which offhand, only recalling the lamenting of other folks who have worked on these PS1 GPU emulation over the years).

@ZachCook
Copy link
Collaborator

PlotPixel, PlotNativePixel, DrawSpan, DrawSprite look like they change behavior with MaskEval_TA forced to 1 with these changes.

I definitely agree with reverting my fast compile hack with those runtime performance regressions, it might even be better to keep the longer compile times and larger binary size if this optimization would add too much divergence between user and developer builds, and being disabled by default there is going to be low usage.

I'll take a look at unused template parameters to see if we can get some of the compile speed and size benefits without regressions.

@jstine35
Copy link
Author

jstine35 commented May 27, 2021

Ah, I see - the GPU is baking 0x8000 instead of reading gpu->MaskEvalAND -- because it knows the value can only be either 0 or 0x8000. (pushed a fix for that)

Another option is to disable inlining, though this would still be an extra makefile option. If we specify __attribute__((noinline)) on DrawSpan and PlotPixel the result on build speed should be similar - maybe better than - removing template parameters. It's still not something we'd want enabled normally tho.

Also .. is gcc using LTO optimization? It doesn't seem to be specified, but maybe GCC defaults to LTO now with -O3? 60+ seconds to link a 6MB executable is not something I'd expect unless LTO were enabled. If it is enabled, turning it off would be another way to get very rapid build iteration times in development, with low perf impact. Also something we only want to turn off for developer builds and enable when building Release/Distribution libs.

@lgtm-com
Copy link

lgtm-com bot commented May 27, 2021

This pull request introduces 2 alerts when merging 6ee97d8 into 9589ddd - view on LGTM.com

new alerts:

  • 2 for Short-circuiting operator applied to flag

@ZachCook
Copy link
Collaborator

There is also the rsx_intf_push_* hw renderer stuff that uses MaskEval_TA too.

Does removing pgxp from the command_drawpolygon template slow down runtime perf of the user build too like MaskEval_TA does? That is a large source of compile time and binary size by itself.

@inactive123
Copy link
Contributor

Hi there @jstine35 , if you can fix the merge conflict I'd happily merge this.

@ZachCook
Copy link
Collaborator

ZachCook commented Aug 7, 2021

I think now that I know the impact of removing the templates, that we don't really need this fast compile option, it will only make things harder to understand, maintain, and debug, without benefiting users in any way, and will see very little usage overall. Just reverting d8479c7, to get the performance back is all that is really required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants