Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. #11888

Merged
merged 2 commits into from Dec 25, 2018
Merged

ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. #11888

merged 2 commits into from Dec 25, 2018

Conversation

mreineck
Copy link
Contributor

@mreineck mreineck commented Sep 5, 2018

This pull request was opened at the request of @matthew-brett and is related to issues #11885, #11886 and #10625.

@ncgeib
Copy link

ncgeib commented Sep 6, 2018

I ran the same scripts as I did for #10625 to compare the accuracy and speed. The accuracy issue for real-valued FFTs is fixed and in general pocketfft is faster than the current implementation in numpy but still slower than scipy's fftpack (not taking the additional twiddle factor creation into account). For prime sizes it is vastly faster due to using an algorithm with a lower complexity.

All seems good to me. Only thing I could think of is that maybe the implementation of pocketfft could be sped up further.

accuracy_complex
accuracy_real

EDIT: Removed speed benchmarks as they did not reflect the current status of the PR. See below.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

Thanks a lot for these tests, @ncgeib!

Concerning performance, I don't fully understand yet why scipy's implementation is faster for regular sizes ... the algorithms are basically the same.

Hmm, one difference is that pocketfft does its own memory management, i.e. for every transform it calls malloc() and free() a handful of times. This could cause a slowdown, especially for short transforms. Maybe that's what we are seeing here.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

Sorry, I'm being stupid ... of course I know why pocketfft is currently slower than scipy!

At the moment, pocketfft is not creating plans beforehand and caching them for re-use. That means that for every repetition of the transform it recomputes the twiddle factors all over again.

This can be fixed of course, but I'll probably need help writing a Python class that caches the pocketfft plans.

@matthew-brett
Copy link
Contributor

I wonder whether some small Cython classes would help here ...

@ncgeib
Copy link

ncgeib commented Sep 6, 2018

@mreineck Yeah, sorry if that was not clear. The comparison is not really fair as the timings exclude the twiddle factor calculation and cache creation for numpy and scipy and include the twiddle factor calculation for pocketfft.
Still, I am not sure if this explains the whole runtime difference. numpy's C port of fftpack is also slower than the Fortran version. We would have to measure either the twiddle computation or use a cached interface of pocketfft.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

The code now does some caching.

If someone could tell me the canonical way to pass a C pointer to Python, I'd be grateful. Currently I'm casting its value to a "long" and feel really dirty about it.

@charris
Copy link
Member

charris commented Sep 6, 2018

canonical way to pass a C pointer to Python

NpyCapsule_*, see numpy/core/include/numpy/npy_3kcompat.h and grep for usage.

@rgommers
Copy link
Member

rgommers commented Sep 6, 2018

I ran the same scripts as I did for #10625 to compare the accuracy and speed.

@ncgeib nice. It may be useful to contribute the accuracy benchmarks as tests and the speed ones as asv benchmarks to the various packages, if you have time. I doubt our current tests and benchmarks are as comprehensive.

@ncgeib
Copy link

ncgeib commented Sep 6, 2018

I re-ran the speed tests. The comparison should now be fair (and more or less accurate). We see that pocketfft is still just a little bit slower than scipy's fftpack for most test cases. Maybe there is some overhead that could be reduced?

speed_complex
speed_real

@rgommers If I find the time I will do that. The benchmarks basically copy the ones FFTW does. For the meantime, they can be found at https://github.com/ncgeib/benchmark-ffts

@rgommers
Copy link
Member

rgommers commented Sep 6, 2018

@rgommers If I find the time I will do that. The benchmarks basically copy the ones FFTW does. For the meantime, they can be found at https://github.com/ncgeib/benchmark-ffts

oh okay, if you mean an actual copy then never mind - FFTW is GPL licensed so we cannot use that code. standalone repo is still useful though, we can then run it again in case of other PRs with major changes like this one

@ncgeib
Copy link

ncgeib commented Sep 6, 2018

@rgommers Sorry, I meant they copy the methodology the FFTW authors use. No code was copied in any way - I simply found the way they test the accuracy of the FFT implementations to be good (comparing against a high-precision calculation of the DFT). This allows to compare the graphs to the ones on their webpage.

I specifically wrote the benchmark scripts while hoping that at one point they could be used when numpy finally gains a better FFT implementation. In that sense I am quite happy now ;)

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

@ncgeib I feel much more comfortable with the new benchmarks! The remaining performance difference between scipy and pocketfft is largest for small transform sizes and almost vanishes for longer ones, so my hope is that we can still gain a little bit by optimizing the wrapper code on the Python side.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

Hmmm ... I don't see any locking/unlocking going on in the scipy code when the caches are updated. Could this be the reason for the performance difference? And why is scipy not subject to race conditions?

@rgommers
Copy link
Member

rgommers commented Sep 6, 2018

Hmmm ... I don't see any locking/unlocking going on in the scipy code when the caches are updated. Could this be the reason for the performance difference? And why is scipy not subject to race conditions?

From scipy/scipy#1419: ... This means that SciPy's FFTs are not threadsafe, and thus the GIL cannot be released. Thus, SciPy has to lock up the interpreter while doing FFTs. ...

@mreineck
Copy link
Contributor Author

mreineck commented Sep 6, 2018

Maybe there's our performance overhead.

Also, scipy is doing cache management on the C side, which could also help.

But anyway I think this is a detail we can look into once we are convinced that everything else works as intended.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 7, 2018

There seem to be unnecessary array copies in fftpack.py (and via copy/paste also in pocketfft.py):

The function fft starts with:

    a = asarray(a).astype(complex, copy=False)

(i.e. potentially there is no copy), while ifft starts with:

    # The copy may be required for multithreading.
    a = array(a, copy=True, dtype=complex)

This looks conspicuously inconsistent...

@mreineck
Copy link
Contributor Author

mreineck commented Sep 7, 2018

BTW, I think that most of numpy.fft's slowness compared to scipy is explained by the many calls to _check_size() in helper.py. This is seriously suboptimal code.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 7, 2018

After some more measurements I'm convinced that the remaining overhead is caused by cache lookup and management. If I use a primitive cache which stores just a single transform, I get speeds that lie practically on top of the scipy data points everywhere.
Not sure what to make of that...

@mreineck
Copy link
Contributor Author

mreineck commented Sep 7, 2018

This is what I get with the latest patch. Please keep in mind that this is not directly comparable to earlier performance plots, because they were measured on a different computer!

speed_complex
speed_real

@ncgeib
Copy link

ncgeib commented Sep 7, 2018

I can confirm these on my machine. In the latest state of the PR the runtime of pocketfft is basically the same as scipy's fftpack for regular numbers.

@mreineck
Copy link
Contributor Author

mreineck commented Sep 7, 2018

Great!
So, what's the procedure from here on?

All Travis builds fail (even those for Python 3), because C89 is enforced. Do we need to request a change in policy here?

@charris
Copy link
Member

charris commented Sep 7, 2018

We will try to move to C99 after Python 2.7 is dropped. Note that other parts of NumPy will need some minor fixes, or at least that was my impression last time I tried with the C99 flag.

@charris charris added this to the 1.17.0 release milestone Sep 7, 2018
@charris
Copy link
Member

charris commented Sep 7, 2018

I'm thinking of branching 1.16.0 around the end of November.

@charris
Copy link
Member

charris commented Sep 7, 2018

You can try changing the build/test environment in this PR to see what needs doing, but it would be best to keep those changes as separate commits as we might want to split the resulting PR when we move up. In particular, you can drop the 2.7 tests and look to making the rest of NumPy C99 safe. If some of that can be done without breaking 2.7 on windows, so much the better.

charris added a commit to charris/numpy that referenced this pull request Sep 7, 2018
[skip appveyor]
C99 looks to work with gcc for all the Python versions we currently
support. Testing with c99 is a step forward in testing numpy#11888 before
it gets merged after we drop Python 2.7 support.
@charris
Copy link
Member

charris commented Sep 7, 2018

I've put up #11905 to start testing with c99 on travisCI.

@charris
Copy link
Member

charris commented Dec 25, 2018

Looks like we need to modify lines 344-352 in runtests.py to remove -Werror=declaration-after-statement and add -std=c99. I'll make a PR.

@charris
Copy link
Member

charris commented Dec 25, 2018

See #12610.

@mhvk
Copy link
Contributor

mhvk commented Dec 25, 2018

This all looks great, but does need a release note!

Also, having not looked in any detail: the old fft could not do single precision; can pocketfft? Important for large but noisy data sets. (But obviously no reason to hold up the PR; if it were possible to implement easily, however, maybe raise an issue so it can be done later.)

@charris
Copy link
Member

charris commented Dec 25, 2018

@mreineck I've made a separate PR for the travis-test.sh testing fix, after that goes in you will want to rebase, probably as part of adding a release note.

@tacaswell
Copy link
Contributor

I think matplotlib did that way back when and had some problems, athough I'm not really familiar with that package. @tacaswell Are you folks still using C++.

Yes, we still have a bunch of c++ (mostly because Agg is templated c++).

@charris
Copy link
Member

charris commented Dec 25, 2018

@mreineck Might also want to squash the 84 commits :)

@mreineck
Copy link
Contributor Author

@mreineck I've made a separate PR for the travis-test.sh testing fix, after that goes in you will want to rebase, probably as part of adding a release note.

What would that look like? Some addition in doc/changelog/1.17.0-changelog.rst?

I'll fetch the latest changes from master and squash as soon as the tester runs through.

@mreineck
Copy link
Contributor Author

Also, having not looked in any detail: the old fft could not do single precision; can pocketfft?

Not at the moment. I think it could be added, but I don't really have the experience where we have to keep double precision to avoid losing too much accuracy. If memory usage is the concern, people will be better off with an in-place implementation that computes twiddle factors on the fly. Performance-wise I don't expect any speedup potential unless explicit vectorization support is added.

@charris
Copy link
Member

charris commented Dec 25, 2018

@mreineck Put the notes doc/release/1.17.0-notes.rst under "Improvements". I think a one line list entry in "Highlights" would also be appropriate, i.e., * Blah, blah....

@mhvk
Copy link
Contributor

mhvk commented Dec 25, 2018

@mreineck - with single precision, all calculations would be single precision; in the old fftpack, there was

#define DOUBLE
#ifdef DOUBLE
#define Treal double
#else
#define Treal float
#endif

and then all code is defined in terms of Treal. I think if one were to add this, one would write the code duplication using the machinery also used for ufuncs. But that is for another PR! (Perhaps as part of making the fft routines generalized ufuncs...)

@mreineck
Copy link
Contributor Author

Simply replacing double by float will have very undesirable consequences for twiddle factor computation. But I agree, that is for another time...

@mreineck
Copy link
Contributor Author

OK, I think I'm done.

@charris
Copy link
Member

charris commented Dec 25, 2018

OK, let's get this in. I haven't reviewed the sources, but I think the FFT is not difficult to test, so will rely on the tests and user feedback leading up to the 1.17.0 release. Thanks @mreineck .

@charris
Copy link
Member

charris commented Dec 25, 2018

@mreineck We should also add some benchmarks for this. There are currently some in scypy that can probably be modified to cover things like large(ish) prime sizes.

@charris charris merged commit bae3cfa into numpy:master Dec 25, 2018
@rgommers rgommers changed the title WIP: Add pocketfft sources to numpy for testing, benchmarks, etc. ENH: Add pocketfft sources to numpy for testing, benchmarks, etc. Dec 27, 2018
@mreineck
Copy link
Contributor Author

@charris Thanks a lot for merging!

I agree that correctness testing for FFTs is not hard to do. The previously existing unit tests have been kept, and I added identity tests (real and complex) for all lengths between 1 and 512. That should catch most potential problems.

Concerning benchmarks, I think @ncgeib's work (#11888 (comment)) would be the best start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet