Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the 128-bit vectorization of Poly1305 MAC for PowerPC #453

Open
mamonet opened this issue Jun 14, 2021 · 7 comments
Open

Optimize the 128-bit vectorization of Poly1305 MAC for PowerPC #453

mamonet opened this issue Jun 14, 2021 · 7 comments

Comments

@mamonet
Copy link
Member

mamonet commented Jun 14, 2021

This issue offers two tips that boost the performance of vectorized code significantly on PowerPC architecture in addition to a little boost for other architectures like x86_64.
I've used Poly1305 MAC to observe the performance loss and spot the bottleneck at assembly level while keeping in mind that POWER8/POWER9 processors can pipeline down to 6 vector instructions simultaneously to achieve a reciprocal throughput of 0.5 for the used vector instructions. I've used son_ibm_ppc branch for testing.

Tip 1:
The latency of writing to memory buffers inside the loop that iterates message blocks of authentication algorithms like Poly1305 could cause overhead for the entire function, avoiding unnecessary writing operations could eliminate the overhead and increase the performance.
I'll demonstrate the changes on Hacl_Poly1305_128.c which can be generated from Hacl.Impl.Poly1305.fst using KreMLin tool.
The function Hacl_Poly1305_128_poly1305_update() read and write to the context buffer ctx of acc offset inside the main loop, the generated assembly code updates certain fields in ctx buffer for every block iteration.
To eliminate that overhead one needs to initiate acc values in local variables before the loop procedure begin and store those variables after loop end then use the local variable for the reciprocal values inside the loop.
Note The compiler is aware that the loaded values from pre buffer are same among blocks so the generated code load those values only once before the loop start.
By applying this tip, the performance of Hacl_Poly1305_128_poly1305_update() has been increased 8% on POWER9 processor and 5.5% for x86_64 architecture (Tested on Ice Lake arch)
The change isn't tested or tuned for 256-bit vector registers since PowerPC doesn't have any.

Tip 2:
Power ISA v3.00 has introduces several instructions like extended set of vector load/store operations and vector value extraction that can be taken advantage from many functions in HACL* which can be applied by appending -mpower9-vector flag in gcc-compatible/configure
The overall performance increase of Hacl_Poly1305_128_poly1305_update() by using Power ISA v3.00 is 47% on POWER9 processor.
Note the use of -mpower9-vector flag should be optional so the vectorized code would maintain the support of older CPUs.

Benchmark numbers of Hacl_Poly1305_128_poly1305_update() on POWER9 measured by cycles per byte (cpb)

Original Patch (Tips applied)
3.90 cpb 2.47 cpb

In my opinion, this is the best performance we can get from using C intrinsic to optimize Poly1305. To get the optimal performance one needs to add support of PowerPC to Vale so functions like Poly1305 MAC can be implemented at assembly level with verification, in this way we can have more control over execution flow and achieve higher performance by saturating the execution units. I'll open another issue to describe a potential roadmap of adding PowerPC support to Vale.

@msprotz
Copy link
Contributor

msprotz commented Jun 15, 2021

Thanks Maamoun, that's really neat! To follow up on this:

  • can you attach a patch for the generated C code so that we can see precisely what needs to happen?
  • how to detect POWER9 in configure? should I call uname or something?

Thanks!

@mamonet
Copy link
Member Author

mamonet commented Jun 15, 2021

Thanks Maamoun, that's really neat! To follow up on this:

  • can you attach a patch for the generated C code so that we can see precisely what needs to happen?

It can be done as shown in this variant poly1305.txt (diff file)

  • how to detect POWER9 in configure? should I call uname or something?

I would recommend to add -mpower9-vector flag according to a user input for build configuration since the build machine's architecture and the target architecture may vary.

@msprotz
Copy link
Contributor

msprotz commented Jun 15, 2021

Thanks @mamonet that looks great. Just to clarify, do you plan on submitting PRs that implement these two changes or are you just filing bugs in the hope that someone picks them up?

Cheers,

Jonathan

@mamonet
Copy link
Member Author

mamonet commented Jun 15, 2021

Thanks @mamonet that looks great. Just to clarify, do you plan on submitting PRs that implement these two changes or are you just filing bugs in the hope that someone picks them up?

Submitting a PR for the first patch requires some sort of familiarity for many HACL* aspects like function assurance and pre/post condition asserts, I'm at a point where I can get a working patch for the Low* template but still have some difficulty keeping up with all the aspects which need some time to grasp. I don't want to end up with a patch that miss a persistent feature or not fully compatible with the other contexts. As soon as I'm ready to make full-fledged patches regarding Low* files, I'll submit a PR for that patch if nobody picked it up.

@msprotz
Copy link
Contributor

msprotz commented Jun 15, 2021

Thanks. @polubelova wrote this code so she (might) be able to tell us whether it's an easy fix or a more complex one

I can take care of adding something to the configure script

@mamonet
Copy link
Member Author

mamonet commented Jun 15, 2021

Thank you. Any assistance or hints would be great. I'm able to upload my first try to modify Hacl.Impl.Poly1305.fst to reflect the change if that helps the code maintainer.

@mamonet
Copy link
Member Author

mamonet commented Jun 16, 2021

It seems a tuning flags has been added to the configure without me noticing. However, I've submitted a proper configure patch in PR #455 with slight performance boost on POWER9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants