New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spirv-opt fma folding causes large performance degradation on Adreno #5658
Comments
I'll just correct a few things:
The extended instructions for GLSL were released together with the initial Vulkan and SPIR-V release. FMA has always been there, it was never added later.
No, That being said, fusing mul+add is something that pretty much all vendor drivers do, so this optimization is unlikely to have much effect in practice. If one vendor has a bad (perf-wise) implementation of FMA then it shouldn't hurt to drop it. |
Thanks for the correction. I had wrongly assumed the "Extended Instructions for GLSL" was an extension that came after the initial SPIR-V release. However, I don't think the text in Appendix A that you point to is sufficiently clear. I'm also a bit confused, because reading the text of an earlier release of the "SPIR-V Extended Instructions for GLSL" document (revision 4): https://registry.khronos.org/SPIR-V/specs/1.0/GLSL.std.450.pdf The behavior of the Fma extended instruction and its interaction with the NoContract decorator is described very clearly.
However, in revision 12 that text has disappeared. The description is not as precise and the interactions with NoContract are not there anymore:
https://registry.khronos.org/SPIR-V/specs/unified1/GLSL.std.450.pdf Looking at the change history I don't see any mention of this change so it appears that this is most likely an error. |
I think it was because it was replaced with this text in the Vulkan spec (https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html#spirvenv-evaluation-expressions):
How is this not sufficiently clear? IMO it captures the same thing, while also being clear that |
I don't see how the revision 12 is more clear than revision 4, despite of the text you quote. Also, my goal is to a) Have this regression fixed. and b) try to understand how this regression occurred in the first place in order to correct that. I think one of the reasons why it may have occurred is due to a misunderstanding of what the fma instruction does and how it should be interpreted by the implementations. This misunderstanding may not have occurred if the documentation was more clear. There are two elements that contributed to this regression:
In both cases there are different parties that have made mistakes interpreting the specification, which reinforces my belief that the specification could be more clear. |
This was done for Swiftshader, but that is not as important. We will revert those changes to avoid the poor code generated by Adreno compilers. As for why this change was done.
spirv-opt viewed it a neutral if you used the Fma or not because, without the nocontract, ICD compilers are free to use a hardware fma or split it up into two instructions in either case. Our view is that a reasonable ICD compiler would at worst be neutral. I do not view this as wrong. Since spirv-opt was used in a special way in Swiftshader, this optimization is what allowed it to consistently generate FMAs on the CPUs. We had a use case where it was a good idea to generate the FMAs, so we went ahead with it. We cannot test every change on all of the different platforms that exists, and we rely on the community to report any oddities. Thanks for bringing it to our attention. We probably would have noticed it sooner, but the version of spirv-opt in the android NDK has not been updated since this change went in. |
This removes the folding rules added in KhronosGroup#4783 and KhronosGroup#4808. They lead to poor code generation on Adreno devices when 16-bit floating point values were used. Since this change is transformation is suppose to be neutral, there is no general reason to continue doing it. I have talked to the owners of SwiftShader, and they do not mind if the transform is removed. They were the ones the requested the change in the first place. Fixes KhronosGroup#5658
This removes the folding rules added in #4783 and #4808. They lead to poor code generation on Adreno devices when 16-bit floating point values were used. Since this change is transformation is suppose to be neutral, there is no general reason to continue doing it. I have talked to the owners of SwiftShader, and they do not mind if the transform is removed. They were the ones the requested the change in the first place. Fixes #5658
Hi Steven, thanks for addressing this and for your clarification. I agree that in an ideal world that transformation should have been neutral and not have caused any problem. It's unfortunate it took so long to detect this problem. If SwiftShader or other users benefit from it, perhaps it could be added back under an optional command line argument. |
I recently discovered the reason why spirv-opt was causing a major performance degradation on some of my shaders. I narrowed it down to the
simplify-instructions
stage and digging deeper I found the culprit were the folding rules that transform sequence of multiply adds and subs into FMAs.The issue was introduced in these two commits:
And the simplest solution is to comment out these two lines of code:
The performance drop appears to be caused by poor handling of explicit FMA operations in the Adreno compiler. In particular, it appears that it's only implemented for fp32, so when the arguments are in fp16 they are converted to fp32 and the result is converted back to fp16. As a result, what was intended to be an optimization, actually results in a significant performance drop.
For the Spark codecs that I've been developing, the proposed fix usually results in a 20% to 30% increased throughput.
Here's the example output of the Adreno Offline compiler for one of my codecs:
Note not only the increase in instruction count, but also the ratio between 32 bit and 16 bit ALU instructions.
Note, this affects all the Adreno devices that I'm aware of. A driver fix is on the way, but many devices will never be upgraded to the latest driver, so I would highly encourage everyone to adopt this workaround.
I've also tested the proposed change on many other devices: This has no effect on any Arm device, the results are the same in both cases. It produces a small performance improvement on PowerVR devices, and a very slight regression on AMD/Samsung.
In general, I would argue that spirv-opt should not be issuing FMAs. Originally the spirv spec did not include an FMA operation, this was added later through the extended instructions for GLSL:
https://registry.khronos.org/SPIR-V/specs/unified1/GLSL.std.450.html
My expectation is that FMA should be used in the same spirit as it was defined in GLSL. The purpose of FMA in GLSL was never to be an optimization, but to give the programmer more control over how certain expressions were evaluated, which in some cases would result in slower code. Outside of a precise statement the FMA does not impose any constrain, it's merely a hint, but inside precise statements the spec provides certain guarantees to ensure consistent evaluation:
https://registry.khronos.org/OpenGL-Refpages/gl4/html/fma.xhtml
For more background, long time ago I wrote about the motivations behind the precise statements and the fma intrinsic:
Watertight Tessellation: precise and fma
SPIR-V doesn't have the concept of precise operations. It does expose the NoContraction decorator and we could overload it to mean that an FMA needs to be evaluated as if it was part of a precise statement, however this is not specified anywhere. In the absence of a formal specification I think it's natural for IHVs to respect FMAs in all scenarios, which could actually reduce optimization opportunities.
The text was updated successfully, but these errors were encountered: