Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spirv-opt fma folding causes large performance degradation on Adreno #5658

Closed
castano opened this issue May 2, 2024 · 6 comments · Fixed by #5682
Closed

spirv-opt fma folding causes large performance degradation on Adreno #5658

castano opened this issue May 2, 2024 · 6 comments · Fixed by #5682

Comments

@castano
Copy link

castano commented May 2, 2024

I recently discovered the reason why spirv-opt was causing a major performance degradation on some of my shaders. I narrowed it down to the simplify-instructions stage and digging deeper I found the culprit were the folding rules that transform sequence of multiply adds and subs into FMAs.

The issue was introduced in these two commits:

And the simplest solution is to comment out these two lines of code:

//rules_[spv::Op::OpFAdd].push_back(MergeMulAddArithmetic);
//rules_[spv::Op::OpFSub].push_back(MergeMulSubArithmetic);

The performance drop appears to be caused by poor handling of explicit FMA operations in the Adreno compiler. In particular, it appears that it's only implemented for fp32, so when the arguments are in fp16 they are converted to fp32 and the result is converted back to fp16. As a result, what was intended to be an optimization, actually results in a significant performance drop.

For the Spark codecs that I've been developing, the proposed fix usually results in a 20% to 30% increased throughput.

Here's the example output of the Adreno Offline compiler for one of my codecs:

With FMA fix:
Total instruction count                                     :   868
ALU instruction count - 32 bit                              :   52
ALU instruction count - 16 bit                              :   466

Without FMA fix:
Total instruction count                                     :   1077
ALU instruction count - 32 bit                              :   200
ALU instruction count - 16 bit                              :   335

Note not only the increase in instruction count, but also the ratio between 32 bit and 16 bit ALU instructions.

Note, this affects all the Adreno devices that I'm aware of. A driver fix is on the way, but many devices will never be upgraded to the latest driver, so I would highly encourage everyone to adopt this workaround.

I've also tested the proposed change on many other devices: This has no effect on any Arm device, the results are the same in both cases. It produces a small performance improvement on PowerVR devices, and a very slight regression on AMD/Samsung.


In general, I would argue that spirv-opt should not be issuing FMAs. Originally the spirv spec did not include an FMA operation, this was added later through the extended instructions for GLSL:

https://registry.khronos.org/SPIR-V/specs/unified1/GLSL.std.450.html

My expectation is that FMA should be used in the same spirit as it was defined in GLSL. The purpose of FMA in GLSL was never to be an optimization, but to give the programmer more control over how certain expressions were evaluated, which in some cases would result in slower code. Outside of a precise statement the FMA does not impose any constrain, it's merely a hint, but inside precise statements the spec provides certain guarantees to ensure consistent evaluation:

https://registry.khronos.org/OpenGL-Refpages/gl4/html/fma.xhtml

For more background, long time ago I wrote about the motivations behind the precise statements and the fma intrinsic:

Watertight Tessellation: precise and fma

SPIR-V doesn't have the concept of precise operations. It does expose the NoContraction decorator and we could overload it to mean that an FMA needs to be evaluated as if it was part of a precise statement, however this is not specified anywhere. In the absence of a formal specification I think it's natural for IHVs to respect FMAs in all scenarios, which could actually reduce optimization opportunities.

@cwabbott0
Copy link

I'll just correct a few things:

Originally the spirv spec did not include an FMA operation, this was added later

The extended instructions for GLSL were released together with the initial Vulkan and SPIR-V release. FMA has always been there, it was never added later.

SPIR-V doesn't have the concept of precise operations. It does expose the NoContraction decorator and we could overload it to mean that an FMA needs to be evaluated as if it was part of a precise statement, however this is not specified anywhere.

No, NoContract is already the same thing as precise in GLSL. This is specified in the Vulkan specification, appendix A ("Vulkan environment for SPIR-V") in the "Precision and Operation of SPIR-V Instructions" section. The reason it's not in the SPIR-V spec is that unlike GLSL, SPIR-V is used in different environments, like OpenCL, which have different precision requirements.

That being said, fusing mul+add is something that pretty much all vendor drivers do, so this optimization is unlikely to have much effect in practice. If one vendor has a bad (perf-wise) implementation of FMA then it shouldn't hurt to drop it.

@castano
Copy link
Author

castano commented May 6, 2024

Thanks for the correction. I had wrongly assumed the "Extended Instructions for GLSL" was an extension that came after the initial SPIR-V release.

However, I don't think the text in Appendix A that you point to is sufficiently clear. I'm also a bit confused, because reading the text of an earlier release of the "SPIR-V Extended Instructions for GLSL" document (revision 4):

https://registry.khronos.org/SPIR-V/specs/1.0/GLSL.std.450.pdf

The behavior of the Fma extended instruction and its interaction with the NoContract decorator is described very clearly.

Computes a * b + c. In uses where this operation is decorated with NoContraction:

  • fma is considered a single operation, whereas the expression a * b + c is considered two operations.
  • The precision of fma can differ from the precision of the expression a * b + c.
  • fma will be computed with the same precision as any other fma decorated with NoContraction, giving invariant
    results for the same input values of a, b, and c.
    Otherwise, in the absence of a NoContraction decoration, there are no special constraints on the number of
    operations or difference in precision between fma and the expression a * b +c.

However, in revision 12 that text has disappeared. The description is not as precise and the interactions with NoContract are not there anymore:

Computes a * b + c.

https://registry.khronos.org/SPIR-V/specs/unified1/GLSL.std.450.pdf

Looking at the change history I don't see any mention of this change so it appears that this is most likely an error.

@cwabbott0
Copy link

I think it was because it was replaced with this text in the Vulkan spec (https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#spirvenv-evaluation-expressions):

Implementations may rearrange floating-point operations using any of the mathematical properties governing the expressions in precise arithmetic, even where the floating- point operations do not share these properties. This includes, but is not limited to, associativity and distributivity, and may involve a different number of rounding steps than would occur if the operations were not rearranged. In shaders that use the SignedZeroInfNanPreserve Execution Mode the values must be preserved if they are generated after any rearrangement but the Execution Mode does not change which rearrangements are valid. This rearrangement can be prevented for particular operations by using the NoContraction decoration.

How is this not sufficiently clear? IMO it captures the same thing, while also being clear that NoContract also disallows algebraic rearrangement and this also affects other opcodes like FMix. And as I said before, different environments have different precision requirements so putting this Vulkan-specific text in the SPIR-V spec isn't appropriate.

@castano
Copy link
Author

castano commented May 6, 2024

I don't see how the revision 12 is more clear than revision 4, despite of the text you quote.

Also, my goal is to a) Have this regression fixed. and b) try to understand how this regression occurred in the first place in order to correct that.

I think one of the reasons why it may have occurred is due to a misunderstanding of what the fma instruction does and how it should be interpreted by the implementations. This misunderstanding may not have occurred if the documentation was more clear.

There are two elements that contributed to this regression:

  • spirv-opt is treating SPIR-V Fma instructions as an optimization. However, the glsl language does not offer any guarantees about how fmas are executed outside precise statements.
  • The Adreno compiler is interpreting SPIR-V Fma instructions literally and translating them to native multiply-add instructions.

In both cases there are different parties that have made mistakes interpreting the specification, which reinforces my belief that the specification could be more clear.

@s-perron
Copy link
Collaborator

This was done for Swiftshader, but that is not as important. We will revert those changes to avoid the poor code generated by Adreno compilers.

As for why this change was done.

spirv-opt is treating SPIR-V Fma instructions as an optimization. However, the glsl language does not offer any guarantees about how fmas are executed outside precise statements.

spirv-opt viewed it a neutral if you used the Fma or not because, without the nocontract, ICD compilers are free to use a hardware fma or split it up into two instructions in either case. Our view is that a reasonable ICD compiler would at worst be neutral. I do not view this as wrong. Since spirv-opt was used in a special way in Swiftshader, this optimization is what allowed it to consistently generate FMAs on the CPUs. We had a use case where it was a good idea to generate the FMAs, so we went ahead with it.

We cannot test every change on all of the different platforms that exists, and we rely on the community to report any oddities. Thanks for bringing it to our attention. We probably would have noticed it sooner, but the version of spirv-opt in the android NDK has not been updated since this change went in.

s-perron added a commit to s-perron/SPIRV-Tools that referenced this issue May 22, 2024
This removes the folding rules added in KhronosGroup#4783 and KhronosGroup#4808. They lead to
poor code generation on Adreno devices when 16-bit floating point values
were used. Since this change is transformation is suppose to be neutral,
there is no general reason to continue doing it.

I have talked to the owners of SwiftShader, and they do not mind if the
transform is removed. They were the ones the requested the change in the
first place.

Fixes KhronosGroup#5658
@s-perron s-perron self-assigned this May 22, 2024
s-perron added a commit that referenced this issue May 22, 2024
This removes the folding rules added in #4783 and #4808. They lead to
poor code generation on Adreno devices when 16-bit floating point values
were used. Since this change is transformation is suppose to be neutral,
there is no general reason to continue doing it.

I have talked to the owners of SwiftShader, and they do not mind if the
transform is removed. They were the ones the requested the change in the
first place.

Fixes #5658
@castano
Copy link
Author

castano commented May 22, 2024

Hi Steven, thanks for addressing this and for your clarification. I agree that in an ideal world that transformation should have been neutral and not have caused any problem. It's unfortunate it took so long to detect this problem. If SwiftShader or other users benefit from it, perhaps it could be added back under an optional command line argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants