Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

Open
kimwalisch opened this issue May 9, 2022 · 2 comments

Comments

@kimwalisch
Copy link
Owner

primesieve::iterator's performance depends heavily on the fillNextPrimes() method from PrimeGenerator.cpp. For x64 we have a vectorized AVX512 algorithm that is pretty optimal for this task. Once other CPU architectures (e.g. arm64) support 512-bit vector instructions like AVX512 we should port our AVX512 algorithm to these CPU architectures.

ARM has recently added (2021) the Scalable Vector Extension (SVE) to its CPUs. SVE is supposed to be a portable vector instruction set that works with different vector instructions widths. However for vectorizing our fillNextPrimes() method we need at least 512-bit vector instructions.

@kimwalisch
Copy link
Owner Author

Primesieve's AVX512 fillNextPrimes() algorithm is close to optimal for converting 1-bits from the sieve array into the corresponding bit indexes/positions (and next into prime numbers). The algorithm mainly relies on the VPCOMPRESSB and VPERMB instructions from AVX512.

According to ChatGPT the new ARM SVE instruction set has those instructions as well.

  • Yes, ARM Scalable Vector Extension (SVE) does offer a similar functionality to Intel AVX-512's VPCOMPRESSB instruction. In ARM SVE, the svcompress_b instruction serves a similar purpose.
  • Yes, ARM Scalable Vector Extension (SVE) provides a similar functionality to Intel's AVX-512 VPERMB instruction. In SVE, you can achieve similar byte-wise permute operations using the sveperm instruction family.

This means that it should be relatively easy to port Primesieve's AVX512 fillNextPrimes() algorithm to ARM SVE. But there is one big drawback, in ARM SVE the vector length is dynamic which adds another layer of complexity for implementing this algorithm.

@kimwalisch
Copy link
Owner Author

According to ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads, as of 2024 there are no ARM CPUs out yet that support SVE or SV2. And SVE2 will likely be limited to 128-bits for the next few years, on top of that the dynamic vector length of ARM SVE is a pain in the ass in many cases (including its use in primesieve). Maybe these issues explain why there is no wide adoption of ARM SVE yet...

It is best to wait a few years until ARM CPUs ship a vector extension of at least 256-bits (ideally we want 512-bits like in AVX512.). Ideally we want a fixed vector length instruction set, similar to ARM NEON or AVX512.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant