Improve performance for powers of two #2

calebzulawski · 2020-01-21T04:20:17Z

The mixed-radix FFT algorithm's performance isn't too bad (at the time of writing, powers of three are something like 60% slower than FFTW on my machine).

However, the mixed-radix algorithm (at least as it exists right now) isn't the best choice for powers of two (approximately 3x slower than FFTW on my machine). I have to imagine the discrepancy here is that FFTW is probably using a modified split-radix algorithm. I'd probably be happy with a standard split-radix implementation for simplicity.

ejmahler · 2020-01-22T04:37:40Z

One thing worth looking into with all of this: When comparing benchmarks to rustFFT, is the difference in performance between power2 sizes down to algirthms, or is it down to the SIMD?

If it's just SIMD, I don't think you'll get much from split radix. I implemented it for rustFFT and ended up going back to radix4 because it was actually slower, not faster. I didn't implement the modified one, so maybe it's faster, but the impression i get is that it shaves off fractions of percents of performanc,e not 60%.

It's also possible that fourier's code structure is just better for code generation than rustfft, or maybe split radix is particularly well-suited to SIMD. So it's worth investigating.

calebzulawski · 2020-01-22T13:00:22Z

I do believe that the performance difference between Fourier and RustFFT comes down to SIMD (for powers of two, at least). It might be worth looking into the arithmetic operation count between mixed-radix and split-radix, just to get an idea of what to expect in the optimal case, you might be right that it's only a small improvement. If split-radix isn't the answer, I have to wonder what incantations FFTW, Intel, etc use for powers of two for such a big difference in performance.

ejmahler · 2020-03-09T09:26:24Z

I've done some exploration into split radix with AVX, and I've got promising results. According to the benchmarks, my hardcoded sizes are actually faster than FFTW, Larger split radix sizes are slower than FFTW, but faster than fourier.

Here's a link to the implementation

The main event is the SplitRadixAvx struct, and MixedRadixAvx4x2, 4x4, etc structs.

To assemble a power of two algorithm for benchmarks, i've been using this function:

fn get_splitradix_avx(len: usize) -> Arc<FFT<f32>> {
    match len {
        8 => Arc::new(MixedRadixAvx4x2::new(false)),
        16 => Arc::new(MixedRadixAvx4x4::new(false)),
        32 => Arc::new(MixedRadixAvx4x8::new(false)),
        64 => Arc::new(MixedRadixAvx8x8::new(false)),
        _ => {
             let mut radishes = Vec::new();
            radishes.push(Arc::new(MixedRadixAvx4x8::new(false)) as Arc<FFT<f32>>);
            radishes.push(Arc::new(MixedRadixAvx8x8::new(false)) as Arc<FFT<f32>>);

            while radishes.last().unwrap().len() < len {
                let quarter = Arc::clone(&radishes[radishes.len() - 2]);
                let half = Arc::clone(&radishes[radishes.len() - 1]);
                radishes.push(Arc::new(SplitRadixAvx::new(half, quarter)) as Arc<FFT<f32>>);
            }
            Arc::clone(&radishes.last().unwrap())
        }
    }
}

The hardcoded sizes are all pretty darn fast. I imagine there are improvements to be had, but they're gonna be small. I think there are still a lot of major improvements to be made to the split radix impl.

I also haven't put any work into f64 inputs at all, I imagine that's going to take an entirely different set of techniques.

I'd love to incorporate this into rustfft, but I need specialization to land before I can release a stable version, and there's a pretty brutal specialization bug that is preventing me from even incorporating it into the planner. But your library doesn't depend on specialization, so I bet you could use this stuff right away. Let me know what you think!

ejmahler · 2020-03-09T09:34:36Z

I implemented a few different hardcoded sizes using both mixed radix and split radix. For example I wrote both a split radix size 16, and a mixed radix 4x4. And I wrote a split radix size 32 implementation and a mixed radix 4x8 implementation. In both cases, the mixed radix was both dramatically simpler and dramatically faster.

Split radix definitely reduces multiplication counts and requires fewer twiddle factors, which is a big deal at large sizes, but for small sizes, all the reshuffling of data appears to just get in the way.

calebzulawski · 2020-03-20T12:57:33Z

Whoops, writing this on mobile and hit the wrong button :( This is very interesting and a great improvement, hopefully this weekend I'll get a chance to look into it. I wonder if there is a reformulation that is similar to Stockham autosort that eliminates the shuffling?

calebzulawski closed this as completed Mar 20, 2020

calebzulawski reopened this Mar 20, 2020

calebzulawski added enhancement New feature or request help wanted Extra attention is needed labels Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for powers of two #2

Improve performance for powers of two #2

calebzulawski commented Jan 21, 2020 •

edited

ejmahler commented Jan 22, 2020

calebzulawski commented Jan 22, 2020

ejmahler commented Mar 9, 2020 •

edited

ejmahler commented Mar 9, 2020

calebzulawski commented Mar 20, 2020 •

edited

Improve performance for powers of two #2

Improve performance for powers of two #2

Comments

calebzulawski commented Jan 21, 2020 • edited

ejmahler commented Jan 22, 2020

calebzulawski commented Jan 22, 2020

ejmahler commented Mar 9, 2020 • edited

ejmahler commented Mar 9, 2020

calebzulawski commented Mar 20, 2020 • edited

calebzulawski commented Jan 21, 2020 •

edited

ejmahler commented Mar 9, 2020 •

edited

calebzulawski commented Mar 20, 2020 •

edited