-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the 128-bit vectorization of Poly1305 MAC for PowerPC #453
Comments
Thanks Maamoun, that's really neat! To follow up on this:
Thanks! |
It can be done as shown in this variant poly1305.txt (diff file)
I would recommend to add -mpower9-vector flag according to a user input for build configuration since the build machine's architecture and the target architecture may vary. |
Thanks @mamonet that looks great. Just to clarify, do you plan on submitting PRs that implement these two changes or are you just filing bugs in the hope that someone picks them up? Cheers, Jonathan |
Submitting a PR for the first patch requires some sort of familiarity for many HACL* aspects like function assurance and pre/post condition asserts, I'm at a point where I can get a working patch for the Low* template but still have some difficulty keeping up with all the aspects which need some time to grasp. I don't want to end up with a patch that miss a persistent feature or not fully compatible with the other contexts. As soon as I'm ready to make full-fledged patches regarding Low* files, I'll submit a PR for that patch if nobody picked it up. |
Thanks. @polubelova wrote this code so she (might) be able to tell us whether it's an easy fix or a more complex one I can take care of adding something to the configure script |
Thank you. Any assistance or hints would be great. I'm able to upload my first try to modify Hacl.Impl.Poly1305.fst to reflect the change if that helps the code maintainer. |
It seems a tuning flags has been added to the configure without me noticing. However, I've submitted a proper configure patch in PR #455 with slight performance boost on POWER9. |
This issue offers two tips that boost the performance of vectorized code significantly on PowerPC architecture in addition to a little boost for other architectures like x86_64.
I've used Poly1305 MAC to observe the performance loss and spot the bottleneck at assembly level while keeping in mind that POWER8/POWER9 processors can pipeline down to 6 vector instructions simultaneously to achieve a reciprocal throughput of 0.5 for the used vector instructions. I've used son_ibm_ppc branch for testing.
Tip 1:
The latency of writing to memory buffers inside the loop that iterates message blocks of authentication algorithms like Poly1305 could cause overhead for the entire function, avoiding unnecessary writing operations could eliminate the overhead and increase the performance.
I'll demonstrate the changes on Hacl_Poly1305_128.c which can be generated from Hacl.Impl.Poly1305.fst using KreMLin tool.
The function Hacl_Poly1305_128_poly1305_update() read and write to the context buffer
ctx
ofacc
offset inside the main loop, the generated assembly code updates certain fields inctx
buffer for every block iteration.To eliminate that overhead one needs to initiate
acc
values in local variables before the loop procedure begin and store those variables after loop end then use the local variable for the reciprocal values inside the loop.Note The compiler is aware that the loaded values from
pre
buffer are same among blocks so the generated code load those values only once before the loop start.By applying this tip, the performance of Hacl_Poly1305_128_poly1305_update() has been increased
8%
onPOWER9
processor and5.5%
forx86_64
architecture (Tested on Ice Lake arch)The change isn't tested or tuned for 256-bit vector registers since PowerPC doesn't have any.
Tip 2:
Power ISA v3.00 has introduces several instructions like extended set of vector load/store operations and vector value extraction that can be taken advantage from many functions in HACL* which can be applied by appending -mpower9-vector flag in gcc-compatible/configure
The overall performance increase of Hacl_Poly1305_128_poly1305_update() by using Power ISA v3.00 is
47%
onPOWER9
processor.Note the use of -mpower9-vector flag should be optional so the vectorized code would maintain the support of older CPUs.
Benchmark numbers of Hacl_Poly1305_128_poly1305_update() on POWER9 measured by cycles per byte (cpb)
In my opinion, this is the best performance we can get from using C intrinsic to optimize Poly1305. To get the optimal performance one needs to add support of PowerPC to Vale so functions like Poly1305 MAC can be implemented at assembly level with verification, in this way we can have more control over execution flow and achieve higher performance by saturating the execution units. I'll open another issue to describe a potential roadmap of adding PowerPC support to Vale.
The text was updated successfully, but these errors were encountered: