ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. #11888

mreineck · 2018-09-05T17:29:46Z

This pull request was opened at the request of @matthew-brett and is related to issues #11885, #11886 and #10625.

ncgeib · 2018-09-06T10:40:12Z

I ran the same scripts as I did for #10625 to compare the accuracy and speed. The accuracy issue for real-valued FFTs is fixed and in general pocketfft is faster than the current implementation in numpy but still slower than scipy's fftpack (not taking the additional twiddle factor creation into account). For prime sizes it is vastly faster due to using an algorithm with a lower complexity.

All seems good to me. Only thing I could think of is that maybe the implementation of pocketfft could be sped up further.

EDIT: Removed speed benchmarks as they did not reflect the current status of the PR. See below.

mreineck · 2018-09-06T11:24:04Z

Thanks a lot for these tests, @ncgeib!

Concerning performance, I don't fully understand yet why scipy's implementation is faster for regular sizes ... the algorithms are basically the same.

Hmm, one difference is that pocketfft does its own memory management, i.e. for every transform it calls malloc() and free() a handful of times. This could cause a slowdown, especially for short transforms. Maybe that's what we are seeing here.

mreineck · 2018-09-06T11:33:06Z

Sorry, I'm being stupid ... of course I know why pocketfft is currently slower than scipy!

At the moment, pocketfft is not creating plans beforehand and caching them for re-use. That means that for every repetition of the transform it recomputes the twiddle factors all over again.

This can be fixed of course, but I'll probably need help writing a Python class that caches the pocketfft plans.

matthew-brett · 2018-09-06T11:47:26Z

I wonder whether some small Cython classes would help here ...

ncgeib · 2018-09-06T11:57:48Z

@mreineck Yeah, sorry if that was not clear. The comparison is not really fair as the timings exclude the twiddle factor calculation and cache creation for numpy and scipy and include the twiddle factor calculation for pocketfft.
Still, I am not sure if this explains the whole runtime difference. numpy's C port of fftpack is also slower than the Fortran version. We would have to measure either the twiddle computation or use a cached interface of pocketfft.

mreineck · 2018-09-06T13:46:02Z

The code now does some caching.

If someone could tell me the canonical way to pass a C pointer to Python, I'd be grateful. Currently I'm casting its value to a "long" and feel really dirty about it.

charris · 2018-09-06T14:06:51Z

canonical way to pass a C pointer to Python

NpyCapsule_*, see numpy/core/include/numpy/npy_3kcompat.h and grep for usage.

rgommers · 2018-09-06T16:28:45Z

I ran the same scripts as I did for #10625 to compare the accuracy and speed.

@ncgeib nice. It may be useful to contribute the accuracy benchmarks as tests and the speed ones as asv benchmarks to the various packages, if you have time. I doubt our current tests and benchmarks are as comprehensive.

ncgeib · 2018-09-06T17:15:34Z

I re-ran the speed tests. The comparison should now be fair (and more or less accurate). We see that pocketfft is still just a little bit slower than scipy's fftpack for most test cases. Maybe there is some overhead that could be reduced?

@rgommers If I find the time I will do that. The benchmarks basically copy the ones FFTW does. For the meantime, they can be found at https://github.com/ncgeib/benchmark-ffts

rgommers · 2018-09-06T17:18:20Z

@rgommers If I find the time I will do that. The benchmarks basically copy the ones FFTW does. For the meantime, they can be found at https://github.com/ncgeib/benchmark-ffts

oh okay, if you mean an actual copy then never mind - FFTW is GPL licensed so we cannot use that code. standalone repo is still useful though, we can then run it again in case of other PRs with major changes like this one

ncgeib · 2018-09-06T17:25:27Z

@rgommers Sorry, I meant they copy the methodology the FFTW authors use. No code was copied in any way - I simply found the way they test the accuracy of the FFT implementations to be good (comparing against a high-precision calculation of the DFT). This allows to compare the graphs to the ones on their webpage.

I specifically wrote the benchmark scripts while hoping that at one point they could be used when numpy finally gains a better FFT implementation. In that sense I am quite happy now ;)

mreineck · 2018-09-06T17:48:37Z

@ncgeib I feel much more comfortable with the new benchmarks! The remaining performance difference between scipy and pocketfft is largest for small transform sizes and almost vanishes for longer ones, so my hope is that we can still gain a little bit by optimizing the wrapper code on the Python side.

mreineck · 2018-09-06T18:22:24Z

Hmmm ... I don't see any locking/unlocking going on in the scipy code when the caches are updated. Could this be the reason for the performance difference? And why is scipy not subject to race conditions?

rgommers · 2018-09-06T18:27:13Z

Hmmm ... I don't see any locking/unlocking going on in the scipy code when the caches are updated. Could this be the reason for the performance difference? And why is scipy not subject to race conditions?

From scipy/scipy#1419: ... This means that SciPy's FFTs are not threadsafe, and thus the GIL cannot be released. Thus, SciPy has to lock up the interpreter while doing FFTs. ...

mreineck · 2018-09-06T18:32:18Z

Maybe there's our performance overhead.

Also, scipy is doing cache management on the C side, which could also help.

But anyway I think this is a detail we can look into once we are convinced that everything else works as intended.

mreineck · 2018-09-07T08:17:10Z

There seem to be unnecessary array copies in fftpack.py (and via copy/paste also in pocketfft.py):

The function fft starts with:

    a = asarray(a).astype(complex, copy=False)

(i.e. potentially there is no copy), while ifft starts with:

    # The copy may be required for multithreading.
    a = array(a, copy=True, dtype=complex)

This looks conspicuously inconsistent...

mreineck · 2018-09-07T09:02:07Z

BTW, I think that most of numpy.fft's slowness compared to scipy is explained by the many calls to _check_size() in helper.py. This is seriously suboptimal code.

mreineck · 2018-09-07T13:14:47Z

After some more measurements I'm convinced that the remaining overhead is caused by cache lookup and management. If I use a primitive cache which stores just a single transform, I get speeds that lie practically on top of the scipy data points everywhere.
Not sure what to make of that...

mreineck · 2018-09-07T14:28:54Z

This is what I get with the latest patch. Please keep in mind that this is not directly comparable to earlier performance plots, because they were measured on a different computer!

ncgeib · 2018-09-07T14:32:28Z

I can confirm these on my machine. In the latest state of the PR the runtime of pocketfft is basically the same as scipy's fftpack for regular numbers.

mreineck · 2018-09-07T16:05:06Z

Great!
So, what's the procedure from here on?

All Travis builds fail (even those for Python 3), because C89 is enforced. Do we need to request a change in policy here?

charris · 2018-09-07T16:20:29Z

We will try to move to C99 after Python 2.7 is dropped. Note that other parts of NumPy will need some minor fixes, or at least that was my impression last time I tried with the C99 flag.

charris · 2018-09-07T16:26:31Z

I'm thinking of branching 1.16.0 around the end of November.

charris · 2018-09-07T16:31:21Z

You can try changing the build/test environment in this PR to see what needs doing, but it would be best to keep those changes as separate commits as we might want to split the resulting PR when we move up. In particular, you can drop the 2.7 tests and look to making the rest of NumPy C99 safe. If some of that can be done without breaking 2.7 on windows, so much the better.

[skip appveyor] C99 looks to work with gcc for all the Python versions we currently support. Testing with c99 is a step forward in testing numpy#11888 before it gets merged after we drop Python 2.7 support.

charris · 2018-09-07T19:05:45Z

I've put up #11905 to start testing with c99 on travisCI.

charris · 2018-12-25T16:43:44Z

Looks like we need to modify lines 344-352 in runtests.py to remove -Werror=declaration-after-statement and add -std=c99. I'll make a PR.

charris · 2018-12-25T17:10:23Z

See #12610.

mhvk · 2018-12-25T17:38:24Z

This all looks great, but does need a release note!

Also, having not looked in any detail: the old fft could not do single precision; can pocketfft? Important for large but noisy data sets. (But obviously no reason to hold up the PR; if it were possible to implement easily, however, maybe raise an issue so it can be done later.)

charris · 2018-12-25T18:50:34Z

@mreineck I've made a separate PR for the travis-test.sh testing fix, after that goes in you will want to rebase, probably as part of adding a release note.

tacaswell · 2018-12-25T18:54:40Z

I think matplotlib did that way back when and had some problems, athough I'm not really familiar with that package. @tacaswell Are you folks still using C++.

Yes, we still have a bunch of c++ (mostly because Agg is templated c++).

charris · 2018-12-25T19:04:55Z

@mreineck Might also want to squash the 84 commits :)

mreineck · 2018-12-25T19:19:00Z

@mreineck I've made a separate PR for the travis-test.sh testing fix, after that goes in you will want to rebase, probably as part of adding a release note.

What would that look like? Some addition in doc/changelog/1.17.0-changelog.rst?

I'll fetch the latest changes from master and squash as soon as the tester runs through.

mreineck · 2018-12-25T19:22:39Z

Also, having not looked in any detail: the old fft could not do single precision; can pocketfft?

Not at the moment. I think it could be added, but I don't really have the experience where we have to keep double precision to avoid losing too much accuracy. If memory usage is the concern, people will be better off with an in-place implementation that computes twiddle factors on the fly. Performance-wise I don't expect any speedup potential unless explicit vectorization support is added.

charris · 2018-12-25T19:56:57Z

@mreineck Put the notes doc/release/1.17.0-notes.rst under "Improvements". I think a one line list entry in "Highlights" would also be appropriate, i.e., * Blah, blah....

mhvk · 2018-12-25T20:16:46Z

@mreineck - with single precision, all calculations would be single precision; in the old fftpack, there was

#define DOUBLE
#ifdef DOUBLE
#define Treal double
#else
#define Treal float
#endif

and then all code is defined in terms of Treal. I think if one were to add this, one would write the code duplication using the machinery also used for ufuncs. But that is for another PR! (Perhaps as part of making the fft routines generalized ufuncs...)

mreineck · 2018-12-25T20:41:11Z

Simply replacing double by float will have very undesirable consequences for twiddle factor computation. But I agree, that is for another time...

mreineck · 2018-12-25T20:42:49Z

OK, I think I'm done.

charris · 2018-12-25T22:27:56Z

OK, let's get this in. I haven't reviewed the sources, but I think the FFT is not difficult to test, so will rely on the tests and user feedback leading up to the 1.17.0 release. Thanks @mreineck .

charris · 2018-12-25T22:29:19Z

@mreineck We should also add some benchmarks for this. There are currently some in scypy that can probably be modified to cover things like large(ish) prime sizes.

mreineck · 2018-12-27T10:17:48Z

@charris Thanks a lot for merging!

I agree that correctness testing for FFTs is not hard to do. The previously existing unit tests have been kept, and I added identity tests (real and complex) for all lengths between 1 and 512. That should catch most potential problems.

Concerning benchmarks, I think @ncgeib's work (#11888 (comment)) would be the best start.

rgommers added 01 - Enhancement component: numpy.fft labels Sep 6, 2018

charris added this to the 1.17.0 release milestone Sep 7, 2018

charris mentioned this pull request Sep 7, 2018

TST: Start testing with "-std=c99" on travisCI. #11905

Merged

charris mentioned this pull request Dec 25, 2018

TST: Update runtests.py to specify C99 for gcc. #12610

Merged

mreineck added 2 commits December 25, 2018 21:19

replace fftpack with pocketfft

062dc8f

add release notes

0fe2a58

charris merged commit bae3cfa into numpy:master Dec 25, 2018

rgommers changed the title ~~WIP: Add pocketfft sources to numpy for testing, benchmarks, etc.~~ ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. Dec 27, 2018

saimn mentioned this pull request Jan 2, 2019

TST: setup.py bdist_wheel error for numpy-dev astropy/astropy#8318

Closed

grlee77 mentioned this pull request Feb 10, 2019

MKL error in fft test odlgroup/odl#1462

Open

mreineck mentioned this pull request Apr 5, 2019

Implement bluestein algorithm for prime size FFT (Trac #579) #1177

Closed

mreineck mentioned this pull request May 13, 2019

Use pypocketfft as FFT backend? scipy/scipy#10175

Closed

velichkov mentioned this pull request Jan 14, 2020

Changed numpy to scipy since fftpack is in scipy gnuradio/gnuradio#3069

Closed

masip85 mentioned this pull request Jul 8, 2020

FFT Backend documentation scipy/scipy#12509

Open

rth mentioned this pull request Aug 12, 2020

[MRG+3] FEA Add PolynomialCountSketch to Kernel Approximation module scikit-learn/scikit-learn#13003

Merged

3 tasks

Qiyu8 mentioned this pull request Nov 24, 2020

ENH: Optimize the FFT implementation #17839

Open

mehrdadn mentioned this pull request Jan 1, 2022

fft with 1.16.6 is faster than with 1.18.0 #17130

Open

mreineck mentioned this pull request Sep 30, 2022

ENH: Compile pocketfft with newer vector instructions? scipy/scipy#16984

Open

ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. #11888

ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. #11888

Conversation

mreineck commented Sep 5, 2018

ncgeib commented Sep 6, 2018 • edited

mreineck commented Sep 6, 2018

mreineck commented Sep 6, 2018

matthew-brett commented Sep 6, 2018

ncgeib commented Sep 6, 2018

mreineck commented Sep 6, 2018

charris commented Sep 6, 2018

rgommers commented Sep 6, 2018

ncgeib commented Sep 6, 2018 • edited

rgommers commented Sep 6, 2018

ncgeib commented Sep 6, 2018

mreineck commented Sep 6, 2018

mreineck commented Sep 6, 2018

rgommers commented Sep 6, 2018

mreineck commented Sep 6, 2018

mreineck commented Sep 7, 2018

mreineck commented Sep 7, 2018

mreineck commented Sep 7, 2018

mreineck commented Sep 7, 2018

ncgeib commented Sep 7, 2018

mreineck commented Sep 7, 2018

charris commented Sep 7, 2018

charris commented Sep 7, 2018

charris commented Sep 7, 2018

charris commented Sep 7, 2018

charris commented Dec 25, 2018 • edited

charris commented Dec 25, 2018

mhvk commented Dec 25, 2018

charris commented Dec 25, 2018

tacaswell commented Dec 25, 2018

charris commented Dec 25, 2018

mreineck commented Dec 25, 2018

mreineck commented Dec 25, 2018

charris commented Dec 25, 2018

mhvk commented Dec 25, 2018

mreineck commented Dec 25, 2018

mreineck commented Dec 25, 2018

charris commented Dec 25, 2018

charris commented Dec 25, 2018

mreineck commented Dec 27, 2018

ncgeib commented Sep 6, 2018 •

edited

ncgeib commented Sep 6, 2018 •

edited

charris commented Dec 25, 2018 •

edited