Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance for powers of two #2

Open
calebzulawski opened this issue Jan 21, 2020 · 5 comments
Open

Improve performance for powers of two #2

calebzulawski opened this issue Jan 21, 2020 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@calebzulawski
Copy link
Owner

calebzulawski commented Jan 21, 2020

The mixed-radix FFT algorithm's performance isn't too bad (at the time of writing, powers of three are something like 60% slower than FFTW on my machine).

However, the mixed-radix algorithm (at least as it exists right now) isn't the best choice for powers of two (approximately 3x slower than FFTW on my machine). I have to imagine the discrepancy here is that FFTW is probably using a modified split-radix algorithm. I'd probably be happy with a standard split-radix implementation for simplicity.

@ejmahler
Copy link
Contributor

One thing worth looking into with all of this: When comparing benchmarks to rustFFT, is the difference in performance between power2 sizes down to algirthms, or is it down to the SIMD?

If it's just SIMD, I don't think you'll get much from split radix. I implemented it for rustFFT and ended up going back to radix4 because it was actually slower, not faster. I didn't implement the modified one, so maybe it's faster, but the impression i get is that it shaves off fractions of percents of performanc,e not 60%.

It's also possible that fourier's code structure is just better for code generation than rustfft, or maybe split radix is particularly well-suited to SIMD. So it's worth investigating.

@calebzulawski
Copy link
Owner Author

I do believe that the performance difference between Fourier and RustFFT comes down to SIMD (for powers of two, at least). It might be worth looking into the arithmetic operation count between mixed-radix and split-radix, just to get an idea of what to expect in the optimal case, you might be right that it's only a small improvement. If split-radix isn't the answer, I have to wonder what incantations FFTW, Intel, etc use for powers of two for such a big difference in performance.

@ejmahler
Copy link
Contributor

ejmahler commented Mar 9, 2020

I've done some exploration into split radix with AVX, and I've got promising results. According to the benchmarks, my hardcoded sizes are actually faster than FFTW, Larger split radix sizes are slower than FFTW, but faster than fourier.

Here's a link to the implementation

The main event is the SplitRadixAvx struct, and MixedRadixAvx4x2, 4x4, etc structs.

To assemble a power of two algorithm for benchmarks, i've been using this function:

fn get_splitradix_avx(len: usize) -> Arc<FFT<f32>> {
    match len {
        8 => Arc::new(MixedRadixAvx4x2::new(false)),
        16 => Arc::new(MixedRadixAvx4x4::new(false)),
        32 => Arc::new(MixedRadixAvx4x8::new(false)),
        64 => Arc::new(MixedRadixAvx8x8::new(false)),
        _ => {
             let mut radishes = Vec::new();
            radishes.push(Arc::new(MixedRadixAvx4x8::new(false)) as Arc<FFT<f32>>);
            radishes.push(Arc::new(MixedRadixAvx8x8::new(false)) as Arc<FFT<f32>>);

            while radishes.last().unwrap().len() < len {
                let quarter = Arc::clone(&radishes[radishes.len() - 2]);
                let half = Arc::clone(&radishes[radishes.len() - 1]);
                radishes.push(Arc::new(SplitRadixAvx::new(half, quarter)) as Arc<FFT<f32>>);
            }
            Arc::clone(&radishes.last().unwrap())
        }
    }
}

The hardcoded sizes are all pretty darn fast. I imagine there are improvements to be had, but they're gonna be small. I think there are still a lot of major improvements to be made to the split radix impl.

I also haven't put any work into f64 inputs at all, I imagine that's going to take an entirely different set of techniques.

I'd love to incorporate this into rustfft, but I need specialization to land before I can release a stable version, and there's a pretty brutal specialization bug that is preventing me from even incorporating it into the planner. But your library doesn't depend on specialization, so I bet you could use this stuff right away. Let me know what you think!

@ejmahler
Copy link
Contributor

ejmahler commented Mar 9, 2020

I implemented a few different hardcoded sizes using both mixed radix and split radix. For example I wrote both a split radix size 16, and a mixed radix 4x4. And I wrote a split radix size 32 implementation and a mixed radix 4x8 implementation. In both cases, the mixed radix was both dramatically simpler and dramatically faster.

Split radix definitely reduces multiplication counts and requires fewer twiddle factors, which is a big deal at large sizes, but for small sizes, all the reshuffling of data appears to just get in the way.

@calebzulawski
Copy link
Owner Author

calebzulawski commented Mar 20, 2020

Whoops, writing this on mobile and hit the wrong button :( This is very interesting and a great improvement, hopefully this weekend I'll get a chance to look into it. I wonder if there is a reformulation that is similar to Stockham autosort that eliminates the shuffling?

@calebzulawski calebzulawski reopened this Mar 20, 2020
@calebzulawski calebzulawski added enhancement New feature or request help wanted Extra attention is needed labels Aug 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants