Ideas to improve FFT in Linbox #181

romainlebreton · 2019-02-07T17:51:50Z

List of ideas to improve FFT in Linbox

Use Barrett modular multiplication for integral types in FFT
Implement FFT<Modular<double,double>> using [HLQ'16, Function 3.3, reduce numeric half 1] to compute (a*b mod p) or a variant of it that precomputes b/p (to be checked).

DIF and DIT could be done in bitreverse order because the phases where entries are multiplied by [1 xhi xhi^2 xhi^3], [1 xhi xhi^2 xhi^3], ... would be replaced by multiplication by [xhi xhi xhi xhi], [xhi^2 xhi^2 xhi^2 xhi^2], ... which might simplify computations.
Implement better caching strategies. Suppose that we have k steps in the FFT (n=2^k), there are 2 main possibilities (see [HLQ'16, Algorithm 1]) :
- Regroup steps m by m where m is fixed so that 2^m elements match the Simd vector size (eg m=3 for _mm256 on uint32 since 2^3 = 8 uint32 fit in 256 bits). This comes down to decompose your FFT_2^k into recursive calls to FFT_2^m. In practice, a permutation has to be done every m steps.
- Cut in sqrt(k) steps of size sqrt(k).
If AVX512 are enabled, the intrinsics _mm(256/512)_madd52(lo/hi)_epu64 could be used to implement efficiently FFT on uint64 with p<2^52. Currently, we break SIMD by emulating the function mulhi on 4 scalars in the file simd256_int64.inl

Vectorize FFT initialization of pow_w and pow_wp
The first step of FFT does not need to reduce from 0<=<2p to 0<=<p since it is already the case
Loop unrolling (improve cache locality ?)
the last step of inverse FFT (u v) => ((u+v)*xhi, (u-v)*xhi) is followed by multiplying every element by 1/n. The last multiplication of the FFT and the _*1/n can be merged
the last step of inverse FFT finishes by a reduction from 0<=<4p to 0<=<2p, and a bit later from 0<=<2p to 0<=<p. These should be merged.

[HLQ'16] Modular SIMD arithmetic in MATHEMAGIX

romainlebreton assigned BastienVialla, romainlebreton, pascalgiorgi and cyrilbouvier Feb 7, 2019

romainlebreton added the enhancement label Feb 7, 2019