Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas to improve FFT in Linbox #181

Open
1 of 10 tasks
romainlebreton opened this issue Feb 7, 2019 · 0 comments
Open
1 of 10 tasks

Ideas to improve FFT in Linbox #181

romainlebreton opened this issue Feb 7, 2019 · 0 comments
Assignees

Comments

@romainlebreton
Copy link
Member

romainlebreton commented Feb 7, 2019

List of ideas to improve FFT in Linbox

  • Use Barrett modular multiplication for integral types in FFT
  • Implement FFT<Modular<double,double>> using [HLQ'16, Function 3.3, reduce numeric half 1] to compute (a*b mod p) or a variant of it that precomputes b/p (to be checked).

More technical improvements:

  • DIF and DIT could be done in bitreverse order because the phases where entries are multiplied by [1 xhi xhi^2 xhi^3], [1 xhi xhi^2 xhi^3], ... would be replaced by multiplication by [xhi xhi xhi xhi], [xhi^2 xhi^2 xhi^2 xhi^2], ... which might simplify computations.
  • Implement better caching strategies. Suppose that we have k steps in the FFT (n=2^k), there are 2 main possibilities (see [HLQ'16, Algorithm 1]) :
    • Regroup steps m by m where m is fixed so that 2^m elements match the Simd vector size (eg m=3 for _mm256 on uint32 since 2^3 = 8 uint32 fit in 256 bits). This comes down to decompose your FFT_2^k into recursive calls to FFT_2^m. In practice, a permutation has to be done every m steps.
    • Cut in sqrt(k) steps of size sqrt(k).
  • If AVX512 are enabled, the intrinsics _mm(256/512)_madd52(lo/hi)_epu64 could be used to implement efficiently FFT on uint64 with p<2^52. Currently, we break SIMD by emulating the function mulhi on 4 scalars in the file simd256_int64.inl

Ideas with potentially less impact on performances to try first on NoSimd variant:

  • Vectorize FFT initialization of pow_w and pow_wp
  • The first step of FFT does not need to reduce from 0<=<2p to 0<=<p since it is already the case
  • Loop unrolling (improve cache locality ?)
  • the last step of inverse FFT (u v) => ((u+v)*xhi, (u-v)*xhi) is followed by multiplying every element by 1/n. The last multiplication of the FFT and the _*1/n can be merged
  • the last step of inverse FFT finishes by a reduction from 0<=<4p to 0<=<2p, and a bit later from 0<=<2p to 0<=<p. These should be merged.

Reference

[HLQ'16] Modular SIMD arithmetic in MATHEMAGIX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants