You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Use Barrett modular multiplication for integral types in FFT
Implement FFT<Modular<double,double>> using [HLQ'16, Function 3.3, reduce numeric half 1] to compute (a*b mod p) or a variant of it that precomputes b/p (to be checked).
More technical improvements:
DIF and DIT could be done in bitreverse order because the phases where entries are multiplied by [1 xhi xhi^2 xhi^3], [1 xhi xhi^2 xhi^3], ... would be replaced by multiplication by [xhi xhi xhi xhi], [xhi^2 xhi^2 xhi^2 xhi^2], ... which might simplify computations.
Implement better caching strategies. Suppose that we have k steps in the FFT (n=2^k), there are 2 main possibilities (see [HLQ'16, Algorithm 1]) :
Regroup steps m by m where m is fixed so that 2^m elements match the Simd vector size (eg m=3 for _mm256 on uint32 since 2^3 = 8 uint32 fit in 256 bits). This comes down to decompose your FFT_2^k into recursive calls to FFT_2^m. In practice, a permutation has to be done every m steps.
Cut in sqrt(k) steps of size sqrt(k).
If AVX512 are enabled, the intrinsics _mm(256/512)_madd52(lo/hi)_epu64 could be used to implement efficiently FFT on uint64 with p<2^52. Currently, we break SIMD by emulating the function mulhi on 4 scalars in the file simd256_int64.inl
Ideas with potentially less impact on performances to try first on NoSimd variant:
Vectorize FFT initialization of pow_w and pow_wp
The first step of FFT does not need to reduce from 0<=<2p to 0<=<p since it is already the case
Loop unrolling (improve cache locality ?)
the last step of inverse FFT (u v) => ((u+v)*xhi, (u-v)*xhi) is followed by multiplying every element by 1/n. The last multiplication of the FFT and the _*1/n can be merged
the last step of inverse FFT finishes by a reduction from 0<=<4p to 0<=<2p, and a bit later from 0<=<2p to 0<=<p. These should be merged.
Reference
[HLQ'16] Modular SIMD arithmetic in MATHEMAGIX
The text was updated successfully, but these errors were encountered:
List of ideas to improve FFT in Linbox
More technical improvements:
Ideas with potentially less impact on performances to try first on NoSimd variant:
Reference
[HLQ'16] Modular SIMD arithmetic in MATHEMAGIX
The text was updated successfully, but these errors were encountered: