Improve fbp speed #1505

adler-j · 2019-06-04T11:49:34Z

It seems that pyfftw real to complex FFT is not optimised for non-powers of two. This fix increases performance for 1000 angles and 500 pixels from 180ms to 3ms on my laptop.

kohr-h

I like the general idea, but the next power of 2 is not necessarily optimal. In particular pyfftw has a utility function next_fast_len that computes the optimal size. The scipy.fftpack version (though not supported by us) has an analogous function. Perhaps Numpy's FFT has the same characteristics, but since it's slower than pyFFTW anyway, it's probably okay to not optimize it.

So I suggest that you use pyfftw.next_fast_len for pyfftw backend, and either do nothing for numpy or guess and use the same.

adler-j · 2019-06-06T15:56:11Z

Ok, fix for that coming along. Sadly it seems that the function you mentioned is not 100% valid for e.g. rfft so I need to special case that. This function demonstrates the results anyway with that fix:

import time
import numpy as np
import matplotlib.pyplot as plt
import odl
import pyfftw

class Timer:
    def __init__(self, print_at_exit=True):
        self.print_at_exit = print_at_exit

    def __enter__(self):
        self.start_time = time.time()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.end_time = time.time()
        self.exc_type = exc_type
        if self.print_at_exit:
            print(self)

    def __repr__(self):
        return "<{} duration={}{}>".format(self.__class__.__name__, self.duration, self.exception())

    def exception(self):
        return " exception={}".format(self.exc_type.__name__) if self.exc_type else ""

    @property
    def duration(self):
        try:
            return self.end_time - self.start_time
        except AttributeError:
            return -1


impl = 'numpy'
pad = True

if impl == 'pyfftw':
    fft = pyfftw.interfaces.numpy_fft
elif impl == 'numpy':
    fft = np.fft


nv = np.arange(10, 512)
times = np.zeros_like(nv, dtype='float')

for i, n in enumerate(nv):
    if pad:
        npad = odl.trafos.util.next_fast_len(int(n), impl=impl, rfft=True)
    else:
        npad = n
    arr = np.zeros([npad, npad])
    with Timer() as timer:
        fft.irfft(fft.rfft(arr))
    times[i] = timer.duration

plt.semilogy(nv, times, label=impl + '_rfft_' + str(pad))

times = np.zeros_like(nv, dtype='float')

for i, n in enumerate(nv):
    if pad:
        npad = odl.trafos.util.next_fast_len(int(n), impl=impl, rfft=False)
    else:
        npad = n
    arr = np.zeros([npad, npad])
    with Timer() as timer:
        fft.ifft(fft.fft(arr))
    times[i] = timer.duration

plt.semilogy(nv, times, label=impl + '_fft_' + str(pad))

plt.legend()

kohr-h · 2019-06-07T08:13:19Z

Very interesting, thanks for the analysis @adler-j! As it turns out, SciPy's function was first and was ported to pyFFTW later: pyFFTW/pyFFTW#159. I see quite some benchmarking going on there, too, but it's possible that only C2C was covered.

Might be good to make them aware?

kohr-h

Some stuff

kohr-h · 2019-06-07T08:15:28Z

odl/util/utility.py

@@ -1607,6 +1608,28 @@ def unique(seq):
        return unique_values


+def nextpow2(n):
+    """
+    Compute the integer which is a power of two.


On first line.
An maybe: ~~"Return the next integer power of 2."~~
Actually, after the change, it's more like "Return the exponent of the next integer power of 2."

kohr-h · 2019-06-07T08:17:26Z

odl/util/utility.py

+    Parameters
+    ----------
+    n : int
+


No text? Not that there's a lot to say, but doc formatting might break.

kohr-h · 2019-06-07T08:19:28Z

odl/util/utility.py

+    >>> odl.util.nextpow2(7)
+    8
+    >>> odl.util.nextpow2(513)
+    1024


Doctests ought to fail now

kohr-h · 2019-06-07T08:20:26Z

odl/util/utility.py

@@ -1607,6 +1608,28 @@ def unique(seq):
        return unique_values


+def nextpow2(n):


next_int_log2?

kohr-h · 2019-06-07T08:21:05Z

odl/util/utility.py

@@ -47,6 +47,7 @@
    'repr_string',
    'attribute_repr_string',
    'method_repr_string',
+    'nextpow2',


Although tiny, I'd probably put this into numerics.py, if only to avoid littering this kitchen sink file.

kohr-h · 2019-06-07T08:24:31Z

odl/trafos/util/ft_utils.py

@@ -649,6 +650,19 @@ def reciprocal_space(space, axes=None, halfcomplex=False, shift=True,
    return recip_spc


+def next_fast_len(n, impl, rfft=False):
+    """The smallest size for which a fast fft implementation is available."""


kohr-h · 2019-06-07T08:25:12Z

odl/trafos/util/ft_utils.py

@@ -649,6 +650,19 @@ def reciprocal_space(space, axes=None, halfcomplex=False, shift=True,
    return recip_spc


+def next_fast_len(n, impl, rfft=False):


Maybe is_rfft

grlee77 · 2019-06-07T12:52:43Z

Ok, fix for that coming along. Sadly it seems that the function you mentioned is not 100% valid for e.g. rfft so I need to special case that. This function demonstrates the results anyway with that fix:

Do you mean that it isn't 100% valid in SciPy? In FFTW, I am pretty sure it is equally valid for both fft and rfft functions. I think the only case where it doesn't apply are a subset of the real-to-real transforms (e.g. DCT-I, which are not currently wrapped in pyFFTW).

Here is a figure I generated based on your script but calling

npad = pyfftw.interfaces.scipy_fftpack.next_fast_len(int(n))

instead of

npad = odl.trafos.util.next_fast_len(int(n), impl=impl, rfft=True)

Also, if you enable caching of plans and increase the planner effort and number of threads, the difference can become larger. Here is a case using 'FFTW_MEASURE' and 10 threads on 10-core SkylakeX CPU. In this case you need to request this near the top of the benchmark script

pyfftw.config.NUM_THREADS = 10
pyfftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
pyfftw.interfaces.cache.enable()

and make sure to call the transform once before entering the Timer so the cached Plan will be used

        # call once so the FFTW plan will be cached
        fft.irfft(fft.rfft(arr))

        # average time over 10 calls
        nreps = 10
        with Timer() as timer:
            for n in range(nreps):
                fft.irfft(fft.rfft(arr))
        times[i] = timer.duration / nreps

grlee77 · 2019-06-07T13:01:18Z

Related to this, the following work might eventually remove the need to maintain multiple FFT backends in ODL itself:

We currently have a student working on support for multiple FFT backends for SciPy as part of Google Summer of Code (GSOC). As of now it is only week 2 of 12 for the project, but the goal is to allow use of various backends (pyFFTW, pocketfft, mkl-fft, etc) via a unified API using a proposed scipy.fft module.
see: https://github.com/scipy/scipy/wiki/GSoC-2019-project-ideas
some initial work in: scipy/scipy#10238

kohr-h · 2019-06-07T14:14:19Z

Many thanks for the analysis @grlee77! Re: the new scipy.fft effort: That would make things so much easier! Looking forward to trying it out and simplifying our code!

adler-j · 2019-06-09T09:49:59Z

It should be mentioned that there is already the pyfftw numpy compatibility interfaces: https://hgomersall.github.io/pyFFTW/pyfftw/interfaces/interfaces.html the problem with them was that they do not provide optimal usage in terms of e.g. caching and planning. Is the planned scipy module supposed to solve this somehow?

kohr-h · 2019-06-09T10:58:19Z

To me it seems like the exposed interface capabilities will depend on the backend. The SciPy interface, for instance, has always had overwrite_x as option, although I don't know how it affects performance. So I guess the final interface will look identical in the first couple of (positional) arguments and add optional args depending on the backend. Ideally with all the caching and in-place options supported.

But @grlee77 should know better.

grlee77 · 2019-06-12T13:46:10Z

It should be mentioned that there is already the pyfftw numpy compatibility interfaces: https://hgomersall.github.io/pyFFTW/pyfftw/interfaces/interfaces.html the problem with them was that they do not provide optimal usage in terms of e.g. caching and planning.

pyFFTW 0.11 added the ability to configure the default number of threads and planner effort without having to explicitly pass the number of threads and effort to each individual function call (this was used in my example above). When the cache is enabled, I think this should provide most of the performance one would get when explicitly preplanning transforms.

Is the planned scipy module supposed to solve this somehow?

There are still open questions as to what the final form of this will be and whether backend-specific arguments will be supported, so any feedback you have as a potential downstream user would be greatly appreciated. The issue I linked above was mainly for the pocketfft work, but some initial backend discussion has been in scipy/scipy#10175 and some earlier posts on the scipy-dev mailing list. There is not yet an open PR for the backend support (Peter started out with the addition of pypocketfft to SciPy and is just now starting to implement the backends). Peter prepared a design document with initial plans that will be revised as needed going forward. At the moment we are exploring whether or not we should use uarray to implement the backend.

The SciPy interface, for instance, has always had overwrite_x as option, although I don't know how it affects performance.

It can help some, but has also caused some confusion on the user end because setting it to True is not a guarantee that the array will be overwritten (this is only possible for some transform types). So the user should not try to use it in the same way as the out= argument provided by many numpy functions (it is basically like FFTW's FFTW_DESTROY_INPUT flag).

adler-j added 2 commits June 4, 2019 12:46

ENH: Use nextpow2 for faster FBP

3569935

ENH: Improve speed of FBP filter by nextpow2

6c93628

adler-j requested a review from kohr-h June 4, 2019 12:07

kohr-h requested changes Jun 6, 2019

View reviewed changes

MAINT: Change nextpow2 to standard definition

53322ea

Improve speed of fbp by fastest padding

152dab2

kohr-h requested changes Jun 7, 2019

View reviewed changes

kohr-h added this to In revision in PR Status Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve fbp speed #1505

Improve fbp speed #1505

adler-j commented Jun 4, 2019

kohr-h left a comment

adler-j commented Jun 6, 2019

kohr-h commented Jun 7, 2019

kohr-h left a comment

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

kohr-h Jun 7, 2019

grlee77 commented Jun 7, 2019

grlee77 commented Jun 7, 2019

kohr-h commented Jun 7, 2019

adler-j commented Jun 9, 2019

kohr-h commented Jun 9, 2019

grlee77 commented Jun 12, 2019

		@@ -1607,6 +1608,28 @@ def unique(seq):
		return unique_values


		def nextpow2(n):

		@@ -649,6 +650,19 @@ def reciprocal_space(space, axes=None, halfcomplex=False, shift=True,
		return recip_spc


		def next_fast_len(n, impl, rfft=False):

Improve fbp speed #1505

Are you sure you want to change the base?

Improve fbp speed #1505

Conversation

adler-j commented Jun 4, 2019

kohr-h left a comment

Choose a reason for hiding this comment

adler-j commented Jun 6, 2019

kohr-h commented Jun 7, 2019

kohr-h left a comment

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

kohr-h Jun 7, 2019

Choose a reason for hiding this comment

grlee77 commented Jun 7, 2019

grlee77 commented Jun 7, 2019

kohr-h commented Jun 7, 2019

adler-j commented Jun 9, 2019

kohr-h commented Jun 9, 2019

grlee77 commented Jun 12, 2019