Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up the computation of sliding dot product with fft #938

Open
NimaSarajpoor opened this issue Dec 15, 2023 · 90 comments
Open

Speeding up the computation of sliding dot product with fft #938

NimaSarajpoor opened this issue Dec 15, 2023 · 90 comments
Labels
enhancement New feature or request refactor

Comments

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Dec 15, 2023

[Update]
Here, I am providing some important links to external resources or the comments mentioned here:


Currently, in Stumpy, the sliding dot product [of a query Q and a time series T], is computed via one of the two following functions:

  • core.sliding_dot_product, which takes advantage of fft trick using scipy.signal.convolve
  • core._sliding_dot_product, which uses a njit on top of np.dot

The sliding dot product in MATALB (via fft trick) seems to be faster though.

# MATLAB code

%x is the data, y is the query
m = length(y);
n = length(x);

y = y(end:-1:1);%Reverse the query
y(m+1:n) = 0; %aappend zeros

%The main trick of getting dot products in O(n log n) time
X = fft(x);
Y = fft(y);
Z = X.*Y;
z = ifft(Z);

# and then use the slice `z(m:n)`

Can we get closer to the performance of MATLAB?

@seanlaw
Copy link
Contributor

seanlaw commented Dec 18, 2023

@NimaSarajpoor I just noticed that scipy.fft.rfft has a parameter called workers, which can perform FFT in parallel. I wonder if you could try that and see if it makes any difference?

@seanlaw
Copy link
Contributor

seanlaw commented Dec 20, 2023

@NimaSarajpoor I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT license. However, I tried to implement the basic fft function in Python but haven't been able to get the same answer as scipy.fft.fft. Here's what I did (List-7: Final version of the Stockham Algorithm):

import math
import cmath
import numpy as np

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


def fft(n, x):
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    for k in range(n):
        x[k] /= n

Would you mind taking a look? Maybe I messed up somewhere but I've been staring at it for too long and I'm not able to spot anything. Thanks in advance!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 20, 2023

@seanlaw

I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT licene

Cool!

Would you mind taking a look?

Sure! Will take a look.


Also:
I have been trying scipy.fft.rfft / scipy.fft.fft. Also, as you mentioned before, I am using different number of workers, 1 vs os.cpu_count(). Haven't seen any improvement yet compared to stumpy.core.sliding_dot_product.

According to the scipy doc:

The workers argument specifies the maximum number of parallel jobs to split the FFT computation into. This will execute independent 1-D FFTs within x. So, x must be at least 2-D and the non-transformed axes must be large enough to split into chunks. If x is too small, fewer jobs may be used than requested.

I will test again and share the result and code for our future reference.

@NimaSarajpoor
Copy link
Collaborator Author

Also: I have been trying scipy.fft.rfft / scipy.fft.fft. Also, as you mentioned before, I am using different number of workers, 1 vs os.cpu_count(). Haven't seen any improvement yet compared to stumpy.core.sliding_dot_product.

According to the scipy doc:

The workers argument specifies the maximum number of parallel jobs to split the FFT computation into. This will execute independent 1-D FFTs within x. So, x must be at least 2-D and the non-transformed axes must be large enough to split into chunks. If x is too small, fewer jobs may be used than requested.

I will test again and share the result and code for our future reference.

Currently, core.sliding_dot_product is using the scipy.signal.convolve function. In what follows, the performance of code.sliding_dot_product is compared with some alternatives.

sliding_dot_product_v0 = core.sliding_dot_product

def sliding_dot_product_v1(Q, T):  
    n = len(T)
    X = scipy.fft.rfft(T) * scipy.fft.rfft(np.flipud(Q), n=n)
    out = scipy.fft.irfft(X, n=n)
    
    return out[len(Q) - 1 :]


def sliding_dot_product_v2(Q, T):
    n = len(T)
    X = scipy.fft.rfft(T, workers=8) * scipy.fft.rfft(np.flipud(Q), n=n, workers=8)
    out = scipy.fft.irfft(X, n=n, workers=8)
    
    return out[len(Q) - 1 :]


def sliding_dot_product_v3(Q, T):
    n = len(T)
    X = np.fft.rfft(T) * np.fft.rfft(np.flipud(Q), n=n)
    out = np.fft.irfft(X, n=n)

    return out[len(Q) - 1 :]

And, this is the code for tracking the running time for different window sizes

n = 1_000_000
data = np.array(loadmat('./DAMP_data/mit_long_term_ecg14046.mat')['mit_long_term_ecg_14046'][0]).astype(np.float64)
T = data[:n]

t = []
for m in range(3, 5000):
    Q = T[:m]
   
    t1 = time.time()
    comp = sliding_dot_product_function(Q, T)
    t2 = time.time()

    t.append(t2 - t1)    
image

As observed:

  • The functions sliding_dot_product_v1 and sliding_dot_product_v2 are both using scipy rfft and they are the same except for the number of workers. As expected, their performances are close to each other. This is because the number of workers affects the performance if we have 2D inputs.

  • The performance of sliding_dot_product_v3 (using numpy rfft) is close to the existing version sliding_dot_product_v0.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 24, 2023

Thanks @NimaSarajpoor. In case it matters (and if you're not already doing this), it would make sense to test window sizes and/or time series lengths in powers of 2 rather than increments of 1.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 24, 2023

Thanks @NimaSarajpoor. In case it matters (and if you're not already doing this), it would make sense to test window sizes and/or time series lengths in powers of 2 rather than increments of 1.

image

[Note]
According to the source code of scipy.fft.rfft, we can use the parameter n to pad the input Q with zeros to make its length the same as the length of T. so, if T is power of two, I think we do not need to have power of two for the length of query. I am going to provide the performance of sliding dot product functions for queries with length in range (4, 1025):

image

[update] Correction regarding the label of x axis in the bottom figure is: "the length of query "

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 25, 2023

@seanlaw

@NimaSarajpoor I came across a more recent FFT implementation called OTFFT that claims to be faster than FFTW and has a more generous MIT license. However, I tried to implement the basic fft function in Python but haven't been able to get the same answer as scipy.fft.fft. Here's what I did (List-7: Final version of the Stockham Algorithm):

import math
import cmath
import numpy as np

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


def fft(n, x):
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    for k in range(n):
        x[k] /= n

Would you mind taking a look? Maybe I messed up somewhere but I've been staring at it for too long and I'm not able to spot anything. Thanks in advance!

It turns out that x will be output if we just avoid dividing it by n.

def fft0(n, s, eo, x, y):
    if not math.log2(n).is_integer():  # Check if n is power of 2
        pass
    
    m = n // 2
    theta0 = 2 * math.pi / n
    if n == 1:
        if eo:
            for q in range(s):
                y[q] = x[q]
    else:
        for p in range(m):
            wp = complex(math.cos(p*theta0), -math.sin(p*theta0))
            for q in range(s):
                a = complex(x[q + s*(p + 0)])
                b = complex(x[q + s*(p + m)])
                y[q + s*(2*p + 0)] = a + b
                y[q + s*(2*p + 1)] = (a - b) * wp
        fft0(n//2, 2*s, not eo, y, x)


# I swapped the params of `fft` function to make its signature similar to `scipy.ftt.ftt`
def fft(x, n):  
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    
    return x

And, to test it:

for power in range(1, 10):
    n = 2 ** power
    x = np.random.rand(n).astype(complex)
   
    ref = scipy.fft.fft(x)
    np.testing.assert_almost_equal(ref, fft(x, n))

@seanlaw
Copy link
Contributor

seanlaw commented Dec 25, 2023

It turns out that x will be output if we just avoid dividing it by n.

Hmmm, I wonder why they performed the division?! Thanks for figuring it out. I just ported it over blindly without trying to understand it 🤣.

How about the ifft?

def ifft(n, x):
    for p in range(n):
        x[p] = x[p].conjugate()
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    # for k in range(n):
    #     x[k] = x[k].conjugate()

This doesn't seem to match scipy.fft.ifft either.

Would you mind doing a performance comparison if you are able to crack this?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 25, 2023

@seanlaw

It turns out that x will be output if we just avoid dividing it by n.

Hmmm, I wonder why they performed the division?! Thanks for figuring it out. I just ported it over blindly without trying to understand it 🤣.

How about the ifft?

def ifft(n, x):
    for p in range(n):
        x[p] = x[p].conjugate()
    y = np.empty(n, dtype=complex)
    fft0(n, 1, False, x, y)
    # for k in range(n):
    #     x[k] = x[k].conjugate()

This doesn't seem to match scipy.fft.ifft either.

Would you mind doing a performance comparison if you are able to crack this?

This should work:

def _ifft(x):
    n = len(x)   # assuming `n` is power of two 
    x[:] = np.conjugate(x)
    y = np.empty(n, dtype=np.complex128)
    fft0(n, 1, False, x, y)
    
    return np.conjugate(x / n)

I am working on some minor changes to boost the performance. I will share the performance of both original version, and the enhanced version, and will compare them with the core.sliding_dot_product.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 25, 2023

For now, I did some enhancements on the new fft / ifft functions suggested in #938 (comment).


Part (I):
I show the performance of four versions against the performance of our reference version, i.e. core.sliding_dot_product. The output of each version is tested to make sure that the function works correclty. The length of time series T is $2^{15}$, and the length of query is in range(10, 1000 + 10, 10).

These are the description of the four versions:

v0 --> use the new functions fft and ifft 

v1 --> v0 + Reused the already-allocated memory `y`

v2 --> v1 +  Converted the inner for-loop of `fft` function to a numpy vectorized operation 

v3 -->  v2 + Added njit decorator with `fastmath=True`

v4 --> v3 + Parallelized the outer for-loop of `fft` function
image

For a time series with length $2^{15}$, it seems that there is not much difference between v0 and v1. The changes in v2 and v3 seem to be very effective. How about v4? To better demonstrate its impact, in figure below, I am showing the performance of v3, v4, and the reference only.

image

And, we can zoom in further by removing the v3 from the figure. Then, we will see:

image

Part (II):
We now show how the v4 performs against the ref (i.e. core.sliding_dot_product) for different length of time series T.

image

As observed, the gap in the performance becomes bigger as the length of time series increases.


The code is available in the notebook pushed to this PR #939.

Next steps:
(1) Make the code cleaner.
(2) Profile the function to see where that increase in the performance gap comes from.
(3) Optimize accordingly.

@seanlaw
What do you think?
Also: If I need to add/revise a step, please let me know.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 26, 2023

If I understand correctly, the stockham algorithm is NOT faster than scipy.fft.convolve (or they are about the same after some optimizations). Is that correct? And it also means that the stockham algo is much slower than FFTW?

I wonder if there might be some clues in the OTFFT source code in terms of how they might have parallelized it using OpenMP. I'd be pretty happy if we could get within 2x (slower) than FFTW. I'm also guessing that the sawtooth shape observed in scipy.signal.convolve likely comes from switching between two FFT algorithms. It's reassuring to see that OTFFT is pretty stable in performance across different distances. I'm confused as to why using prange wouldn't give you the necessary speedup but that also depends on the hardware that you are using. Maybe I can find some time to test it out on my Mac with many threads.

@NimaSarajpoor
Copy link
Collaborator Author

If I understand correctly, the stockham algorithm is NOT faster than scipy.fft.convolve (or they are about the same after some optimizations). Is that correct? And it also means that the stockham algo is much slower than FFTW?

Yes. After some initial optimizations, I can see that the stockham algorithm (from OTFFT) is slower. So, as you mentioned:

  • Python Stockham algo (from OTFFT) is slower than scipy.fft.convolve.
  • scipy.fft.convolve is slower than MATLAB FFTW.

I wonder if there might be some clues in the OTFFT source code in terms of how they might have parallelized it using OpenMP. I'd be pretty happy if we could get within 2x (slower) than FFTW.

I will try to go through it to get some idea. Need to run the tests on MATALB online server if we want to consider MATLAB FFTW as our benchmark.

I'm confused as to why using prange wouldn't give you the necessary speedup but that also depends on the hardware that you are using.

I think it gave us some boost. See the figure below...
v4 is the same as v3 but with this difference that it uses prange.

292773450-2d09bf64-fc06-47ab-98fe-453247951b6a-2

numba. get_num_threads() is 8 in my macOS system.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 27, 2023

It appears that maybe we should consider implementing the six step or eight step FFT algorithm next as it should have much better memory locailty and is therefore "optimized". I'd expect this to be faster than our current sliding dot product. I'm not sure how/if any of the radix variants (shown at the same link above) will help.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 27, 2023

we should consider implementing the six step or eight step FFT algorithm next as it should have much better memory locailty and is therefore "optimized"

I have implemented six-step-FFT. I will work on eight-step FFT algorithm.

[Note to self]
Before I forget, here are a couple of notes that I may consider need to revisit later:

  • numpy vectorized operation seems to increase the running time (?!)
  • Turning off parallelization may speed up the computation (?!)

@seanlaw
Copy link
Contributor

seanlaw commented Dec 27, 2023

I have implemented six-step-FFT. I will work on eight-step FFT algorithm.

In case it matters, section 6.3 might be relevant as it discusses Stockham and the 6-step algo. More importantly, it describes how/why cache memory is important

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 28, 2023

According to this Mathworks webpage, MATLAB is equipped with built-in multithreading.

The built-in multithreading feature in MATLAB automatically parallelizes computations using underlying libraries at runtime, resulting in significant performance improvements for linear algebra and numerical functions such as fft, eig, mldivide, svd, and sort. Since MATLAB does this implicitly, you do not need to enable it manually. If you would like to control the number of computational threads, you can use the maxNumCompThreads function.

A few notes:
(1) As suggested, we can use maxNumCompThreads to set the thread to 1, and then compare MATLAB code (for sliding dot product) with stumpy.core.sliding_dot_product. we use MATLAB online server.

(2) We can implement 6-step / 8-step fft, and parallelize it, and then compare it with MATLAB code. we use MATLAB online server.

(3) should we expect the 6-step and 8-step implementation (without multithreading) be faster than the scipy.fft.fft ? The reason that I am asking this question is because scipy.fft.fft is written in C++ and I was wondering if it makes sense to expect it to be outperformed by 6-step / 8-step fft written in python? If not, then we can just stick to (2), and see how it performs compared to the MATLAB code.

@seanlaw
Copy link
Contributor

seanlaw commented Dec 29, 2023

should we expect the 6-step and 8-step implementation (without multithreading) be faster than the scipy.fft.fft ?

Based on these performance results I expect the njit version of the 6/8 step algo to be faster than scipy.fft.fft and, possibly, as fast as FFTW.

@NimaSarajpoor
Copy link
Collaborator Author

[Update]
I implemented the 8-step algo as well. I ran it on MATLAB online server, and got this:

image

Based on these performance results I expect the njit version of the 6/8 step algo to be faster than scipy.fft.fft and, possibly, as fast as FFTW.

I will go through section 6.3 and check the algo again to better understand it, and see if I am doing something wrong in my code. I may start with a notebook, and implement each function in one cell to go through them together with you, @seanlaw .

Also, need to take a look at the OTFFT source code

@seanlaw
Copy link
Contributor

seanlaw commented Dec 30, 2023

I implemented the 8-step algo as well. I ran it on MATLAB online server, and got this:

This is 8-step with njit and parallel=True? Any idea if the 8-step is faster than scipy?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Dec 30, 2023

This is 8-step with njit and parallel=True?

Yes, and yes

Any idea if the 8-step is faster than scipy?

I think scipy.fft is slightly faster than my current implementation of 6-step / 8-step algo, in my Mac.

Note that 8-step is for $2^{odd}$, and 6-step is for $2^{even}$. (Btw, the former calls the latter twice). Instead of sliding dot product, I am going to show the performance for fft function. I tested the performance for time series T with length that is power of two, and is from $2^{1}$ to $2^{24}$.

See figure below. ref is scipy.fft.fft, and comp is our fft function, with njit and parallel=True, which calls 6-step algo when the power is even, and calls 8-step, when the power is odd.

image

[Update-1]
After reducing number of redundant arithmetic operations...I got this:

image

[Update-2]
I did another minor enhancements. Also, I am now checking the performance of fft of `T with length $2^{27}$. Furthermore, I now run each case 21 times, and get its median as the running time.

image

@seanlaw
Copy link
Contributor

seanlaw commented Dec 30, 2023

It's surprising that it jumps so dramatically at 2^24 to 2^27. I wonder if FFTW suffers from this too or if it still scales linearly. Technically, this algo should be n*logn in computational complexity.

@NimaSarajpoor
Copy link
Collaborator Author

It's surprising that it jumps so dramatically at 2^24 to 2^27. I wonder if FFTW suffers from this too or if it still scales linearly. Technically, this algo should be n*logn in computational complexity.

I do not know what is happening there. I wanted to try 2^28 but I noticed it takes too much time so I just stopped at 2^27. I Will check the behaviour of MATLAB FFTW for large-size inputs.


As suggested by @seanlaw in #939 (comment), we are going to take advantage of rfft since the input is a real-valued data. Based on the algo provided in 2.6.2 and summarized in #939 (comment), I implemented an initial version of fft that takes advantage of rfft.

In the following figure, ref is scipy.fft.fft, and comp is a njit with parallel=True function that takes advantage of rfft

image

I will work on finding opportunities to improve the performance.


According to #938 (comment), we might be able to just use rfft for computing the sliding-dot-product (see function below). So, we do not need to construct the fft. We just need to implement the equivalent of np.fft.irfft

def sliding_dot_product_v3(Q, T):
    n = len(T)
    X = np.fft.rfft(T) * np.fft.rfft(np.flipud(Q), n=n)
    out = np.fft.irfft(X, n=n)

    return out[len(Q) - 1 :]

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2024

Will check the behaviour of MATLAB FFTW for large-size inputs.

It seems that MATLAB FFTW shows similar behaviour! (see below)


Also, the following figures confirm that six-step / eight-step FFT outperform FFTW in MATLAB

image

Let's zoom in for log2(len(T)) in range(2, 16):
image

How about p in range(2, 21)?
image

MATLAB

# MATLAB
fft_running_time = [];
pmax = 27;

rng(1,"twister");
T = rand(1, 2^pmax);
save('T_input.mat', 'T');  # we will use this for performance of our FFT 

p_list = 2:1:pmax;
for p = p_list
    idx = 2^p;
    data = T(1, 1:idx);
 
    t = tic; 
    X = fft(data);
    fft_running_time(end+1) = toc(t);
end

Python

# Python
fft(np.random.rand(4), is_rfft=False, y=None) # dummy

fft_running_time_python = []
T = np.array(loadmat('./T_input.mat')['T'][0]).astype(np.float64)
p_list = range(2, 28)
for p in p_list:
    idx = 2 ** p
    data = T[:idx]
 
    tic = time.time()
    X = fft(data, is_rfft=False, y=None)
    toc = time.time()
    fft_running_time_python.append(toc - tic)

The fft function used here takes advantage of rfft but returns full-length fft at the end. The fft python code used for this exact performance comparison is provided here:

https://gist.github.com/NimaSarajpoor/5cf2d4f0b0aad5e0898651dddff35c17

@seanlaw
I created and pushed a new notebook to PR #939 , I am now thinking of adding functions from the link above to it, one by one. What do you think?

Maybe we just ignore the recent pushes to the new notebook, and I start over by adding one function at a time.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 2, 2024

In case that matters, I also checked the performance of scipy.fft.fft using MATLAB online for the same input data used in my previous comment.

image

@seanlaw
Copy link
Contributor

seanlaw commented Jan 2, 2024

Also, the following figures confirm that six-step / eight-step FFT outperform FFTW in MATLAB

Amazing! Can you confirm whether MATLAB and 6/8-step Python are single threaded or multithreaded?

I created and pushed a new notebook to PR #939 , I am now thinking of adding functions from the link above to it, one by one. What do you think?

Please give me some time to review and respond

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 3, 2024

Can you confirm whether MATLAB and 6/8-step Python are single threaded or multithreaded?

Before I provide some answers, I should mention that I got slightly different result!

Short answer: They were multithreaded.

Long answer: They were multithreaded. I think it is important to set the number of threads (Not sure if I made any mistake or the online server behaves differently before but I think we should/can trust the following results this time as I am explicity set the number of threats in both MATLAB code and Python code)

To make sure I am not making any accidental mistake, I got the performance for three cases: N_THREADS in {1, 4, 8}

To manage threads:

N = 8  # NUMBER OF THREADS I WOULD LIKE TO USE

# in MATLAB
LASTN = maxNumCompThreads(N);

# in Python
import numba
numba.set_num_threads(N)

I first show the performance of FFT in MATLAB for the three cases:

image

And, now I show the performance of FFT in Python for the same three cases:

image

So, we can see that the number of threads affects the performance.


Okay, let's compare MALTAB vs Python. To show this, I subtract the running time of Python from its corresponding running time via MATLAB code. So a positive y-value means that the Python code is faster than its MATLAB.

image

It seems that as number of threads increase, the six-step / eight-step FFT shows better performance!

Btw, let's zoom in for range(5, 21)

image

And MATLAB-vs-Python when we have 8 threads:

image

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Jan 4, 2024

Okay... I tried to check the performance again. I ran each case for 50 times and then took the average. The following cases are considered:

  • MATLAB fft
  • Python six-step / eight-step fft
  • Python six-step / eight-step rfft
  • Scipy fft
  • Scipy rfft

And, for each of the cases above, I considered different scenarios: 1 thread / 4 thread / 8 thread.

The results are provided below. I decided to not zoom in for each figure so that I can keep this comment easy to follow. In the following figures, 6fft refers to 6-step / 8-step fft


MATLAB

image

Python vs MATLAB [with 1 thread]

image

When we have one thread only, scipy fft / rfft wins.

Python vs MATLAB [with 8 thread]

Note that scipy is not parallelized. But, just to be consistent, I added it to the figure for the sake of comparison.

image

SHOW the diff: Python vs MATLAB [with 8 thread]

In the following figure, the red plot shows MATLAB running time minus the fft running time; And, the blue plot shows MATLAB running time minus the 6fft running time

image

[Update]
And if I zoom into the figure above to better see the comparison for time series with length <= 2 ^ 20, I will get this:
image

which shows MATLAB FFT performs better when length is <= 2 ^ 20

@seanlaw
Copy link
Contributor

seanlaw commented Jan 4, 2024

Thanks @NimaSarajpoor. Can you please provide the raw timing data in a table? I can see that we are doing better at longer time series lengths and with 8 threads. This is great. However, I think the majority of use cases will likely be in the 2^20 range (what do you think?) and so I'd like to see how close we are to MATLAB-FFTW. If we are several magnitudes off (in the <= 2^20 range) in timing then we still have work to do. If we are within 2x then I am okay with that.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Mar 20, 2024

[WIP]

[Warning]
Previously, I mentioned that it is not clear how one can set the number of threads in MATLAB. In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

Examples:

image image image

(I found this when I was trying to re-run codes, and I used a higher number of iterations. I noticed that the performance ratio changed.)

@seanlaw
Copy link
Contributor

seanlaw commented Mar 21, 2024

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Mar 21, 2024

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

I see! yeah... that might be the reason. So, I am now thinking of looking at MATLAB-vs-Python comparison from two perspectives:

(1) I can use median instead of average running time, and I think that should give me a somewhat stable value for MATLAB code's performance. I will try it out.

(2) I can just consider the running time of the first run. My understanding is that a plan for fft(T) cannot be used for computing the fft of a new time series data (to boost its performance). Since we often care about computing the fft(T) only once, it should be okay to just consider the running time of the first run.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 8, 2024

In addition to that, I just discovered that the [average] performance of MATLAB code gets better as the number of iterations increases (generally speaking).

So, if MATLAB uses FFTW then what might be happening is that MATLAB is creating FFTW "plans" in the background that allows it to improve the computation speed: https://www.fftw.org/fftw3_doc/Using-Plans.html

I see! yeah... that might be the reason. So, I am now thinking of looking at MATLAB-vs-Python comparison from two perspectives:

(1) I can use median instead of average running time, and I think that should give me a somewhat stable value for MATLAB code's performance. I will try it out.

(2) I can just consider the running time of the first run. My understanding is that a plan for fft(T) cannot be used for computing the fft of a new time series data (to boost its performance). Since we often care about computing the fft(T) only once, it should be okay to just consider the running time of the first run.

[WIP]


In the meantime, let's review the OFFT 6-step FFT algorithm (which is for time series T with length $N = 2^{2k}$. With setting n to $2^{k}$ (note that $n = \sqrt{N}$), the timeseries T can be seen as n chunks, each with length n.

6-step FFT
Step 1: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)
Step 2: Apply the function _fft0 on each block
Step 3: Multiply elements by twiddle factor, $W^{pk}_{N}$, which is $e ^ {-j\theta}$. $\theta = \frac{2\pi}{N} \times (p \times k)$, where p and k are the row / column indices of element if T is viewed as a n-by-n matrix (meaning, the p and k for twiddle factor of x[idx] can be computed as follows: p = idx // n and k = idx % n )
Step 4: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)
Step 5: Apply the function _fft0 on each block
Step 6: Transpose blocks, i.e. x[:] = x.reshape(n,n).T.reshape(-1,)

Note:
When T has length $2^{2k+1} (= 2 \times 2^{2k}$), we will use 8-step FFT. It does a preprocessing step followed by calling 6-step FFT twice, one for first half, and one for the second half (each half has length $2^{2k}$).

Reminder:
When T has real-valued elements, RFFT first halves the length, then it calls 6-step FFT or 8-step FFT, depending on log2(len(T) / 2) being even or odd.


Regarding tranpose step:

In our implementation, we do:

@njit
def transpose_v1(x):
    n = int(np.sqrt(len(x))
    for k in range(n):
        for p in range(k + 1, n):
            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[i]

instead of

@njit
def transpose_v2(x):
    n = int(np.sqrt(len(x))
    x[:] = x.reshape(n,n).T.copy().reshape(-1,)
    # note: `.copy()` is required. see: https://github.com/numba/numba/issues/5433
image

So, our for-loop implementation should be okay. I am stopping here regarding tranpose steps. We can get back to transpose step later again if needed.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 11, 2024

Profiling 6-step FFT

Now that we have a (somewhat) clear definition of six-step FFT (see previous comment), we can wrap a njit-decorated function around each step, and start profiling the six-step FFT algo at "step" level.

@njit(fastmath=True)
def _transpose(x, n):
    """
    n is sqrt(len(x))
    """
    for k in range(n):
        for p in range(k + 1, n):
            i = k + p * n
            j = p + k * n
            x[i], x[j] = x[j], x[i]
    
    return


@njit(fastmath=True)  
def _fft0(n, s, eo, x, y, c): 
    """
    A recursive function that is used as part of fft algorithm
    
    n : int
    s : int
    eo: bool
    x : numpy.array 1D
    y : numpy.array 1D
    """
    if n == 2:
        if eo:
            z = y
        else:
            z = x
        
        for i in range(s):
            j = i + s
            a = x[i]
            b = x[j]
            z[i] = a + b
            z[j] = a - b
            
    elif n >= 4:
        m = n // 2
        sm = s * m
    
        twiddle_factor = 1.0
        for p in range(m):
            sp = s * p
            two_sp = 2 * sp
            
            for q in range(s):
                i = sp + q
                j = i + sm
                
                k = two_sp + q
                y[k] = x[i] + x[j]
                y[k + s] = (x[i] - x[j]) * twiddle_factor
        
            twiddle_factor = twiddle_factor * c

        _fft0(m, 2*s, not eo, y, x, c * c)
        
    else:
        pass
    
@njit(fastmath=True)
def _apply_fft0(x, y, n, c_thata):
    for p in range(n):
        start = p * n
        _fft0(n, 1, False, x[start:], y[start:], c_thata)

    return


@njit(fastmath=True)
def _transpose_and_twiddle_factor(x, n, theta_init):
    n_plus_1 = n + 1
    for p in range(n):
        theta0 = theta_init * p
        ppn = p * n_plus_1
        
        c = math.cos(theta0) - 1j * math.sin(theta0)
        w = math.cos(theta0 * p) - 1j * math.sin(theta0 * p)
        for alpha in range(0, n - p):
            i = ppn + alpha
    
            if alpha == 0:
                x[i] = x[i] * w
            else:
                j = ppn + alpha * n
                x[j], x[i] = x[i] * w, x[j] * w
                
            w = w * c

    return


def _sixstep_fft(x, y):
    """
    x and y are of same length, and the log2 of their length is even.
    """
    N = len(x)
    n = int(np.sqrt(N))
    
    theta = 2 * math.pi / n
    c_thata = math.cos(theta) - 1j * math.sin(theta)
    theta_init = 2 * math.pi / N

    t1 = time.time()
    _transpose(x, n)  # step 1
    t2 = time.time()
    _apply_fft0(x, y, n, c_thata)  # step 2
    t3 = time.time()
    _transpose_and_twiddle_factor(x, n, theta_init)  # step 3 & 4
    t4 = time.time()
    _apply_fft0(x, y, n, c_thata)  # step 5
    t5 = time.time()
    _transpose(x, n)   # step 6
    t6 = time.time()
    
    out = np.array([t1, t2, t3, t4, t5, t6])

    return out

As provided above, our _sixstep_fft function returns an array of length 6 from which one can calculate the running time of each step (Note that step 3&4 are merged together)

Note that the log2 of length of input in _sixstep_fft(x, y) must be even. Therefore, I did profiling for different lengths of input 2^p, where p is in range(2, 24 + 1, 2). I used the following code:

pmax = 24
power_values = np.arange(2, pmax + 1, 2)

seed = 0
np.random.seed(seed)
T = np.random.rand(2 ** pmax).astype(np.complex_)

n_iter = 1000
running_time_collect = np.empty((len(power_values), 5), dtype=np.longdouble)

for i, p in enumerate(power_values):
    print(f'p --> {p}')
    n = 2 ** p
    x_main = T.copy()[:n]
    y_main = np.empty_like(x_main)

    x = x_main.copy()
    y = y_main.copy()
    _sixstep_fft(x, y)
    
    running_time = np.empty((n_iter, 5), dtype=np.longdouble)
    for j in range(n_iter):
        x = x_main.copy()
        y = y_main.copy()
        out = _sixstep_fft(x, y)
        running_time[j] = out[1:] - out[:-1]

    running_time_collect[i] = np.median(running_time, axis=0)

Each row of running_time_collect contains 5 values corresponding to the running time of step 1 - 2 - 3&4 - 5 - 6.

image

Note1: I did not plot the running time for input with length 2^p, p < 10, as the calculated running time for almost all steps were zero.
Note 2: The steps 1 and 6 are both "Transpose", and the steps 2 and 5 are both "apply_fft_0". Therefore, to simplify the plot above, we can sum the running time of steps that call the same function. Hence:

image

Note that the transpose part (steps 1 & 6 together) takes around 30% of the running time for p >= 16.


Comparing the performances of different functions for "Transpose" step

def func0(x, n):
  x[:] = np.transpose(x.reshape(n, n)).reshape(-1,)
  return


def func1(x, n):
  x[:] = x.reshape(n,n, order='F').ravel()
  return


def func2(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return


@njit
def func3(x, n):
    x[:] = np.transpose(x.reshape(n, n)).copy().reshape(-1,)
    return


@njit
def func4(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return


@njit(fastmath=True)
def func5(x, n):
  for k in range(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n

      x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func6(x, n):
  for k in range(n):
    kn = k * n
    for p in range(k + 1, n):
        i = k + p * n
        j = p + kn

        x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func7(x, n, x_transpose):
    """
    we already have a helper variable in 6-step fft algorithm, whose length is the same as x's. 
    We can use that helper variable variable and pass it as the third argument of this function
    to store the result. We can then use x as the helper variable in the rest of code.
    """
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return


@njit(fastmath=True, parallel=True)
def func8(x, n):
  for k in prange(n):
    kn = k * n
    for p in range(k + 1, n):
        i = k + p * n
        j = p + kn

        x[i], x[j] = x[j], x[i]

  return

Note: Numba does not support the second argument, order, for np.reshape

I used the following code to get the performance of each function for inputs with different lengths, 2^p, where p is in range(2, 24 + 1, 2).

seed = 0
np.random.seed(seed)

pmin = 2
pmax = 24
T = np.random.rand(2 ** pmax)

p_range_values = range(pmin, pmax + 1, 2)

n_iter = 1000

funcs = {
    'func0' : func0,
    'func1' : func1,
    'func2' : func2,
    'func3' : func3,
    'func4' : func4,
    'func5' : func5,
    'func6' : func6,
    'func7' : func7,
    'func8' : func8,
}

funcs_with_extra_input_x_transpose = ['func7']

performance = {}
for func_name, func_object in funcs.items():
    print(f'=================== {func_name} ===================')
    running_time = []

    for p in p_range_values:
        N = 2 ** p
        n = int(np.sqrt(N))

        x = T[:N].copy()
        
        # [note] we can use functools.partial to avoid the following if-else block
        if func_name in funcs_with_extra_input_x_transpose:
            x_transpose = np.empty_like(x)
            func_object(x, n, x_transpose)
        else:
            func_object(x, n)

        lst = []
        for _ in range(n_iter):
            x = T[:N].copy()
            
            if func_name in funcs_with_extra_input_x_transpose:
                x_transpose = np.empty_like(x)
                ts = time.time()
                func_object(x, n, x_transpose)
                te = time.time()
            else:
                ts = time.time()
                func_object(x, n)
                te = time.time()

            lst.append(te - ts)

        running_time.append(
            np.median(lst)
        )

    performance[func_name] = running_time

# save
with open('performance_transpose_funcs.pickle', 'wb') as handle:
    pickle.dump(performance, handle, protocol=pickle.HIGHEST_PROTOCOL)

We use the performance of func0 as our base (reference). The performances of other functions w.r.t to func0 are shown below:

performance ratio (in MacOS)

A value below the horizontal line y=1 means better performance than the performance of func0 for the same input size.

image

Note: func8 (the multithreaded function) has poor performance for 2^p, with p <= 12. To better reflect the performance of other functions, I am going to show the performance of func8 for p > 12 only.

image

It is interesting to see that the function func5 (and func6) ....

@njit(fastmath=True)
def func5(x, n):
  for k in range(n):
    for p in range(k + 1, n):
        i = k + p * n
        j = p + k * n

        x[i], x[j] = x[j], x[i]

  return

performs poorly for len(T) between 2^16 and 2^20 (inclusively). It is interesting to note that we saw a similar bump in the performance of FFT before. Can this be the reason? Also, why do we have a bump in the performance here?


Interestingly, func7 ....

@njit(fastmath=True)
def func7(x, n, x_transpose):
    """
    we already have a helper variable in 6-step fft algorithm, whose length is the same as x's. 
    We can use that helper variable and pass it as the third argument of this function
    to store the result. We can then use x as the helper variable in the rest of code.
    """
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return

shows a good performance overall. However, it is considerably outperformed by func8 (the multithreaded code) when input size is >= 2^20.

I also ran the code in Linux OS (using Google Colab)

performance of transpose functions (in Linux OS via Google Colab, with two threads)

image

@seanlaw
Copy link
Contributor

seanlaw commented Apr 12, 2024

This cache-oblivious transpose algorithm (and matrix multiplication algorithm) might be useful for future reference as it relates to blocking/tiling

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 17, 2024

This cache-oblivious transpose algorithm (and matrix multiplication algorithm) might be useful for future reference as it relates to blocking/tiling

(TLDR) Tiling seems to be effective for matrix transpose! (Check out func6 below)


Let's set our baseline first. Of the three following functions, we choose func_C (In fact, the performances of the three following functions are very close to each other.)

def func_A(x, n):
  x[:] = np.transpose(x.reshape(n, n)).reshape(-1,)
  return

def func_B(x, n):
  x[:] = x.reshape(n,n, order='').ravel()
  return

def func_C(x, n):
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return

Alternative, more performant functions:

@njit
def func1(x, n):
    # the njit-decorated version of `func_C`
    x[:] = np.transpose(x.reshape(n, n)).ravel()
    return

@njit(fastmath=True)
def func2(x, n):
  # SingleThreaded function that is currently used in our six-step FFT implementation
  for k in range(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n
      x[i], x[j] = x[j], x[i]

  return

@njit(fastmath=True, parallel=True)
def func3(x, n):
  # MultiThreaded function that is currently used in our six-step FFT implementation
  for k in prange(n):
    for p in range(k + 1, n):
      i = k + p * n
      j = p + k * n
      x[i], x[j] = x[j], x[i]

  return


@njit(fastmath=True)
def func4(x, n, x_transpose):
    # a function that was recently proposed in this issue page...
    # for transposing matrix and store it in `x_transpose`
    idx = 0
    for _, val in np.ndenumerate(np.transpose(x.reshape(n, n))):
        x_transpose[idx] = val
        idx += 1
    return


@njit(fastmath=True)
def func5(x, n, x_transpose):
    # tiling, based on this stack overflow: 
    # https://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program
    blocksize = 32
    blocksize = min(blocksize, n)    
    for i in range(0, n, blocksize):
        for j in range(0, n, blocksize):
            for k in range(i, i + blocksize):
                for l in range(j, j + blocksize):
                    x_transpose[l + k * n] = x[k + l * n]
    return


@njit(fastmath=True)
def func6(x, n, x_transpose):
    # Similar to func5. I just use numpy operation instead of two for-loops.
    blocksize = 32
    blocksize = min(blocksize, n)

    x = x.reshape(n, n)
    x_transpose = x_transpose.reshape(n, n)
    for i in range(0, n, blocksize):
        for j in range(0, n, blocksize):
            x_transpose[i:i + blocksize, j:j + blocksize] = x[j:j + blocksize, i:i + blocksize].transpose()
            
    return

In addition to the functions above, we are going to consider one more function based on the algorithm proposed in Cache-Oblivious Algorithms.

@njit(fastmath=True)
def _func7(a, n, N):
    if n <= 32:
        for i in range(n):
            for j in range(i):
                a[i * N + j], a[j * N + i] = a[j * N + i], a[i * N + j]

      
    else:
        k = n // 2

        _func7(a, k, N)
        _func7(a[k:], k, N)
        _func7(a[k * N:], k, N)
        _func7(a[k * N + k:], k, N)

        for i in range(k):
            for j in range(k):
                a[i * N + (j + k)], a[(i + k) * N + j] = a[(i + k) * N + j], a[i * N + (j + k)]

        if n % 2:
            for i in range(n - 1):
                a[i * N + n - 1], a[(n - 1) * N + i] = a[(n - 1) * N + i], a[i * N + n - 1]

@njit
def func7(a, n):
     # based on the algo provided in the following link: 
     # https://en.algorithmica.org/hpc/external-memory/oblivious/
    _func7(a, n, n)

The code to check the performance is provided in this Colab notebook.

The following result is obtained by running the code on my MacOS machine (not the Google Colab). I got 500 samples of running time per case, and used its median. Each data point in the plot below is the running time ratio w.r.t the running time of baseline function.

image

Running the code in Colab gives (roughly) similar result (In Colab, I got 100 samples per case).

image

I think the winner is func6. Numba, titling, and Numpy operation seem to work well together

Previously, we noticed that a for-loop can be faster than Numpy operation when we use Numba. However, it now seems that Numpy operation is faster if we use Numba AND consider Cache.


@seanlaw
func6 is basically func5 but with Numpy operation in each tile. The tiling approach is based on the answer provided in this stackoverflow post. I noticed that this approach is aligned with the approach provided in this paper(see section 3). When size of matrix is exact power of two, the divide-and-conquer approach provided in the paper is the same as two for-loops that result in traversing the matrix tile-by-tile, and each tile is a blocksize-by-blocksize matrix.

@seanlaw
Copy link
Contributor

seanlaw commented Apr 17, 2024

@NimaSarajpoor That's awesome! Great work. It's nice to see that tiling works so well to help reduce cache misses! Also, there's no prange/multithreading either which is good. It's really great and rewarding to be able to learn and leverage published academic research like this. I really wonder if we can rethink tiling for the _stump function based on this new knowledge.

Should we really turn on fastmath? If so, we will need to remember to correct the final output if the original inputs are not finite right? So a _fft function assumes finite inputs (and uses fastmath) but any function that calls _fft will need to handle/correct non-finite values first.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 17, 2024

@seanlaw

It's nice to see that tiling works so well to help reduce cache misses

Yes! This is the first time I see the benefit of considering cache with my own eyes 😅 Thank you for the guidance! That was a very good lesson for me! I still need to study more about it, but I think that was a very good experience!!!

I really wonder if we can rethink tiling for the _stump function based on this new knowledge.

Right! We can revisit it.

Should we really turn on fastmath? If so, we will need to remember to correct the final output if the original inputs are not finite right?

Good question! I am going to check a few things:

(1) what will be the performance after removing fastmath? (or just considering certain flags?)
(2) what should be the output of FFT when input contains inf? Going to check the output of scipy fft & numpy fft.

Btw, STUMPY cares about sliding-dot-product. So, the private _fft that assumes input data is already preprocessed and has no inf/nan should be enough. What do you think? Should we still consider adding fft function that handles inf / nan data ?

@NimaSarajpoor
Copy link
Collaborator Author

Good question! I am going to check a few things:

(1) what will be the performance after removing fastmath? (or just considering certain flags?) (2) what should be the output of FFT when input contains inf? Going to check the output of scipy fft & numpy fft.

Btw, STUMPY cares about sliding-dot-product. So, the private _fft that assumes input data is already preprocessed and has no inf/nan should be enough. What do you think? Should we still consider adding fft function that handles inf / nan data ?

(1) Removing fastmath does not affect the performance of the transpose function func6 provided in this comment. This is good!

(2) Let denote F as the RFFT of a 1D array, then F.real becomes all inf (nan) if there is at least one inf (nan) in the input array. And, the Inverse RFFT (i.e. IRFFT) of F becomes all nan/ inf. We can try to create a fft function that handles this case (i.e. when one/more element is nan or inf) but this is not going to be useful from STUMPY's standpoint. What do you think, @seanlaw ?

@seanlaw
Copy link
Contributor

seanlaw commented Apr 20, 2024

We can try to create a fft function that handles this case (i.e. when one/more element is nan or inf) but this is not going to be useful from STUMPY's standpoint. What do you think, @seanlaw ?

I understand. If we only do _fft then we need to make sure that this private function clearly states the assumptions and then reminds the user that:

F.real becomes all inf (nan) if there is at least one inf (nan) in the input array. And, the Inverse RFFT (i.e. IRFFT) of F becomes all nan/ inf.

In other words, the _fft docstring tells you exactly what you need if you want a public function that can handle any inputs. Of course, it probably takes a lot less effort just to write the fft public function than to add additional documentation to _fft

@seanlaw
Copy link
Contributor

seanlaw commented Apr 22, 2024

I also forgot to ask if you've played around with blocksize = 32 or if it is possible that this choice is hardware dependent? If so, could/should we make this a STUMPY config variable which we could possibly allow the user to run an optimization to find the best blocksize for their hardware?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 22, 2024

Of course, it probably takes a lot less effort just to write the fft public function than to add additional documentation to _fft

Makes sense! Got the point. I will address it after wrapping up the private function.


I also forgot to ask if you've played around with blocksize = 32 or if it is possible that this choice is hardware dependent? If so, could/should we make this a STUMPY config variable which we could possibly allow the user to run an optimization to find the best blocksize for their hardware?

I recently thought about this as I know you usually prefer to use config variable, if possible / needed, rather than having a constant value in a code. Regarding transpose, I tried blocksize with value in {16, 32, 64, 128} in my MacOS and Google Colab. Noticed no significant improvement. Will check again, and share the notebook.


[WIP] Notebooks for profiling different components of our FFT implementation

Also, I have been profiling EACH block of code, to figure out what works best per block.

For instance, regarding x[:] = T[::2] + 1j * T[1::2] (let's call it compress T), I noticed that we can simply use a for-loop with fastmath=True, and we can avoid using prange for T unto length 2^24 (See this notebook). I also tried it in my MacOS with 8 threads and got similar conclusion. (We might get a different conclusion though for different number of threads. something I need to check, and see if it needs to be considered in our implementation or not)

Compress T
Transpose
Apply fft on n blocks, each with size n
Tranpose_with_Twiddle_Factor
8step-function-preprocessing-part
8step-function-postprocessing-part

(To be continued)


Also, another thing I have been trying out is to use pre-computed factors instead of computing each factor in fft algorithm. (This is based on what I read before somewhere(?). It is one of the ways one can use to further improve FFT algorithm) The values of these factors ONLY depend on the length of input.

For instance, let's take a look at the following block of code that is used in the last stage of rfft algorithm.

# n = len(T)
m = n // 2

theta = math.pi / m
for k in range(1, m // 2):
    factor = math.cos(k * theta) - 1j * math.sin(k * theta)

We can see a similar part in the 8-step fft algroithm, which is being called when T has length $2^{even}$.

theta = math.pi / (m // 2)   # m is len(T) // 2
for i in range(m // 2):
   factor = math.cos(i * theta) - 1j * math.sin(i * theta)

We can save factors in an array of length m // 2. The factor in the second block is just the square of factors in the first block (this particular part may not be necessarily faster than math.cos + 1j * math.sin operation! I am not sure if it makes sense to pre-compute the factors and save them in 1D array, and re-use them when fft is called. I am trying to see if I can use the same trick in the recursive function that is used in 6-step fft algorithm.

Note: It is cleaner for sure If we can reach a reasonable performance without this!

@seanlaw
Copy link
Contributor

seanlaw commented Apr 23, 2024

Also, another thing I have been trying out is to use pre-computed factors instead of computing each factor in fft algorithm. (This is based on what I read before somewhere(?). It is one of the ways one can use to further improve FFT algorithm) The values of these factors ONLY depend on the length of input.

You might be able to use functools.cache (rather than lru_cache) to memoize the factors:

@cache
def fft_cosign_factor(i, theta):
    return math.cos(i * theta)

@cache
def fft_sine_factor(i, theta):
    return math.sin(i * theta)

So, when you call either of these functions, Python will check whether the given i and theta combination has been used before. If not, it will perform the computation and keep the result in memory. If the inputs have been seen before then it will simply pull the result from memory and avoid re-computing the result. Note that this will NEVER work if the input is a numpy array as it is not possible to ensure that the elements of the array have not been altered. However, in this case, since i and theta are scalar values, then we're good!

theta = math.pi / m
for k in range(1, m // 2):
    # factor = math.cos(k * theta) - 1j * math.sin(k * theta)  # The next line is equivalent to this
    factor = fft_cosine_factor(k, theta) - 1j * fft_sine_factor(k, theta)

@seanlaw
Copy link
Contributor

seanlaw commented Apr 23, 2024

@NimaSarajpoor I discovered something really interesting while looking into cache-oblivious FFT algorithms. The inventor of cache-oblivious algorithms, Matteo Frigo, had provided a cache-oblivious version of FFT in his original paper (see Section 3) and it turns out that Frigo is also the creator of FFTW! Which, you guessed it, uses a cache-oblivious algorithm!

I am starting to think that (in a separate PR) it might be helpful to create a notebook to help explain cache-oblivious algorithms (with diagrams) starting with the basics of how cache sits in between the CPU and main memory (RAM), to understanding blocks, a simple for-loop of how to divide up a matrix and traverse blocks, and how it can be used in a matrix transpose, and finally to FFT. Ultimately, I'm hoping that this can be generalized to help us think about using blocks/tiles for the stump function as cache-oblivious algorithms (let's call it COA from now on as I'm getting tired of writing it out 😆).

@NimaSarajpoor
Copy link
Collaborator Author

So, when you call either of these functions, Python will check whether the given i and theta combination has been used before. If not, it will perform the computation and keep the result in memory. If the inputs have been seen before then it will simply pull the result from memory and avoid re-computing the result.

Precomputing the array has not shown promising result. I am going with the cache approach. Thanks for the very informative explanation!!!

Frigo is also the creator of FFTW! Which, you guessed it, uses a cache-oblivious algorithm!

Interesting! I skim the paper before. And, our current cache-oblivious matrix transpose is basically following the approach proposed in the paper. I didn't notice though that he is one of the authors of FFTW! It might be too soon to say this (?) but it seems we are on the right track!

I am starting to think that (in a separate PR) it might be helpful to create a notebook to help explain cache-oblivious algorithms (with diagrams) starting with the basics of how cache sits in between the CPU and main memory (RAM), to understanding blocks,

Agreed! I think that will be fun!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Apr 26, 2024

[Update]

I am going with the cache approach.

Apparently the decorators @cache and @njit can work together for a function unless that function itself is being called by another njit-decorated function! (see this GitHub issue)

To the see the impact of cache on performance

from numba import njit
from functools import cache
import math

@cache
@njit
def math_cosine_cache(i, theta):
  N = 10000000
  total = 0
  for _ in range(i, i + N):
    total = total + math.cos(i * theta)
  return total


@njit
def math_cosine(i, theta):
  N = 10000000
  total = 0
  for _ in range(i, i + N):
    total = total + math.cos(i + theta)
  return total

And the problem appears here:

@njit
def wrapper(i, theta):
  return math_cosine_cache(i, theta)

wrapper(0, math.pi)

In fact, cache can decorate the njit-decorated function. However, njit cannot decorate cache-decorated function.

@seanlaw
Copy link
Contributor

seanlaw commented Apr 30, 2024

Precomputing the array has not shown promising result.

Hmm, this seems surprising. Can you share a small, self contained, reproducible example with only the sine/cosine parts?

@NimaSarajpoor
Copy link
Collaborator Author

Surprising indeed! Will do! Also need to share one to show the impact of different block sizes on performance.

I am trying to see if I can replace the fft recursive function with for-loop. Can that affect the performance? I read that it might be faster as a recursive function pushes each call to stack. I will continue working on it for the next few days and provide an update.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented May 6, 2024

Can you share a small, self contained, reproducible example with only the sine/cosine parts?

I created three version and put them all in this notebook. The first version is the original version. The second version takes advantage of the identity equation $e^{j\theta} = cos(\theta) + jsin(\theta)$. In the third version, we pass an array as argument filled with precomputed values. No performance gain is observed in the third version. I think we are dealing with memory access time here.

[Update]
Running the code in my MacOS gives me different result. It shows improvement when using precomputed array.

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented May 6, 2024

Also need to share one to show the impact of different block sizes on performance.

I checked the impact of blocksize in cache-oblivious algorithm for matrix transpose. You can see this notebook. I do not see any (significant) improvement in changing the blocksize.

@NimaSarajpoor
Copy link
Collaborator Author

I am trying to see if I can replace the fft recursive function with for-loop. Can that affect the performance?

I wrote the for-loop version of the recursive function we use in 6-step FFT (see this notebook). Ran it in MacOS and noticed no performance gain.

@seanlaw
Copy link
Contributor

seanlaw commented May 7, 2024

I created three version and put them all in this notebook. The first version is the original version. The second version takes advantage of the identity equation ejθ=cos(θ)+jsin(θ). In the third version, we pass an array as argument filled with precomputed values. No performance gain is observed in the third version. I think we are dealing with memory access time here.

Here a few observations for you to consider:

  1. It looks like fill_c_array isn't even called by func_version3 so is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?
  2. You've split fill_c_array into two functions (a public function and a private function) and will incur the cost of invoking the private function. This may slow things down ever so slightly (test this with and without njit) but probably not significant enough to matter. Of course, now you are compiling two functions instead of one.
  3. So, when I talking about result caching (i.e., "memoization" with lru_cache), I was really referring to the point that values were constantly being recomputed. However, it looks like the c_array is only being computed once and the values are use only once and that same c_array is not being reused for other lengths of n. This is inefficient. Zooming out a bit and thinking about the bigger picture (DAMP or MADRID), we should remember that we'll be calling FFT many, many times but for different lengths of n (that are strictly a power of 2), which means that all of our possible c_array would/should have n in this same range too. It makes no sense to precompute the array once and then throw it away and never use it again. Otherwise, we should test for performance of re-using the c_array. Maybe I misunderstood something?
  4. Is the for loop in _fill_c_array the same as np.power(c_thata, np.arange(m))?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented May 8, 2024

  1. It looks like fill_c_array isn't even called by func_version3 so is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?

You are right (re: isn't even called by func_version3)!

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

So is there even a need to use njit for fill_c_array rather than using straight numpy? Otherwise, why not consider prange for filling the contents of the array?

I think it is worth it to use njit to speed it up if our goal is to compute it when DAMP is initialized.

You've split fill_c_array into two functions (a public function and a private function) and will incur the cost of invoking the private function.

The private function is actually a recursive function. I can convert it to a for-loop, and then keep everything in one function. I will work on it.

However, it looks like the c_array is only being computed once and the values are use only once and that same c_array is not being reused for other lengths of n.

Two notes:
(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is $\sqrt{len(x)}$). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

(2) In DAMP, for example, with each new batch of data point, we need to use FFT. Suppose that the window size m is set to 2^8. When the new batch arrives, we are going to apply FFT on slice with length 2^9, then slice with length 2^10, and so on till we exhaust all historical data point or abandon the search early. Then, we do the same process for the next batch of data. So, in DAMP, for 1M new data points, we may re-use the precomputed array (with length 2^10) 1M times (And, recall that the function func_version3(2^5, s, x, c_arr) is going to be called 2 * (2^5) times.)

Is the for loop in _fill_c_array the same as np.power(c_thata, np.arange(m))?

Right! Will test its performance as it is definitely cleaner!

@NimaSarajpoor
Copy link
Collaborator Author

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is ). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

To properly show the impact of using precomputed array, I can write a function to call func_version3(n, ...) n times (Like what we have in 6-step FFT). This should help us get better / more accurate idea about the impact of this approach on the running time.

@NimaSarajpoor
Copy link
Collaborator Author

[WIP]

I initially wrote this function to be used in func_version3. Later, I decided to not consider it in the running time (as the content of array just depends on its length. It can be computed once and be used in several calls of func_version3 (More on this below)

(1) I simplified the code in the provided notebook. The recursive function func_version3(n, s, x, c_arr) is actually called 2 * n times by the 6-step FFT algorithm (n is ). For instance, for an array x with size 2^22, the function is going to be called 2 * (2^11) times. So, we are going to use the precomputed array 2^12 times in 6-step function.

To properly show the impact of using precomputed array, I can write a function to call func_version3(n, ...) n times (Like what we have in 6-step FFT). This should help us get better / more accurate idea about the impact of this approach on the running time.

The Colab notebook is now modified to address the point above. I also ran the code in my MacOS and got the following result (code)

image

In BOTH functions func_version3 and func_version4, we are using precomputed array that contains the factors. In func_version4, however, we take into account the time required for computing that array.


Now let's consider the full steps 2 & 5 (of our 6-step algorithm) rather than just its cos/sin part (code). I ran it in my MacOS and got the following result:

image

Note that fft_v1 and fft_v2 and fft_v3 are exactly the same except for the "factor" part. In fft_v2, we use precomputed array (the approach we took in func_version4 above). In fft_v3, we pass a factor, and, based on that, we compute the other factors (the approach we took in func_version5 above)

@seanlaw
Copy link
Contributor

seanlaw commented May 9, 2024

Btw is c_thata a typo and should be c_theta?

@NimaSarajpoor
Copy link
Collaborator Author

Right!😅 Thanks for catching it! Will fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor
Projects
None yet
Development

No branches or pull requests

2 participants