Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize aamp for p==2 (and maybe p==1) #729

Open
NimaSarajpoor opened this issue Nov 20, 2022 · 13 comments
Open

Optimize aamp for p==2 (and maybe p==1) #729

NimaSarajpoor opened this issue Nov 20, 2022 · 13 comments
Labels
enhancement New feature or request

Comments

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Nov 20, 2022

Currently, aamp accepts argument p (with default 2.0) to calculate p-norm distance. According to a code written by scipy community, it is better to use np.square(..) when p is 2.

Let's compare

aamp n=10_000, m=50 n=20_000, m=50 n=50_000, m=50
MAIN_BRANCH 0.8 3 20
MAIN_BRANCH-OPTIMZED 0.7 0.3 2

The OPTIMIZED version is by adding the two following functions in stumpy/core.py, and use them in stumpy/aamp.py

# in core.py

@njit(fastmath=True)
def _calculate_P_norm(a, p=2.0):
    # a is array
    if p == 2.0:
        return np.sum(np.square(a))
    else:
        return np.linalg.norm(a, ord=p) ** p


@njit(fastmath=True)
def _rolling_update_P_norm(a, old, new, p=2.0):
    # a is array
    if p == 2.0:
        return a - np.square(old) + np.square(new)
    else:
        return a - np.absolute(old) ** p + np.absolute(new) ** p

We may want to investigate p==1 case as well and do the same. I mean...we may do: np.sum(np.abs(...)))


p==1 and p==2 are the most common cases and it would be good to reduce computing time for these two cases.

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
Do you think we should take care of this first? (It can help us get the computing time in top-k faster)

@seanlaw
Copy link
Contributor

seanlaw commented Nov 20, 2022

Sure! Might as well take care of p=1 as well if there is something there. Will all non-normalized functions use this?

@NimaSarajpoor
Copy link
Collaborator Author

I think it will be used in aamp, gpu_aamp, scraamp, aampi.

We also have the argument p in core._mass_absolute_distance_matrix. However, this function uses scipy.spatial.distance.cdist to calculate p-norm. Hence, regarding this latter case, I think we can leave it as is since I think scipy has already considerred it in cdist (I have not checked the source file of cidst yet)

@seanlaw
Copy link
Contributor

seanlaw commented Nov 20, 2022

I think it will be used in aamp, gpu_aamp, scraamp, aampi.

Where in gpu_aamp would it be used? Which function?

What about maamp? Let's make sure to double check all of the possible places where aamp exists

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 21, 2022

UPDATE

Please allow me first to provide a little info regarding the impact of the two proposed function (see the first comment) on the computing time:

In what follows:

  • func1 denotes _calculate_P_norm (proposed in the first comment in this issue)
  • func2 denotes _rolling_update_P_norm (proposed in the first comment in this issue)

Note that func1 will be used when we want to calculate p_norm directly using all the elements of subsequences. And, func2 will be used when we want to calculate p_norm via rolling approach (i.e. using the previously-calculated p_norm to calculate p_norm for a new pair of subsequences)

In stumpy/aamp.py, we can see the following lines of codes in _compute_diagonal:

if uint64_i == 0 or uint64_j == 0:
   p_norm = (
        np.linalg.norm(
        T_B[uint64_j : uint64_j + uint64_m]
        - T_A[uint64_i : uint64_i + uint64_m],
        ord=p,
    )
    ** p
)
else:
    p_norm = np.abs(
    p_norm
    - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
    ** p
    + np.absolute(
        T_B[uint64_j + uint64_m - uint64_1]
        - T_A[uint64_i + uint64_m - uint64_1]
    )
    ** p
)

Now, by using func1 and func2, we will have:

if uint64_i == 0 or uint64_j == 0:
    p_norm =func1(..)   # calculate p_norm directly using all elements of subsequence

else:
    p_norm = func2(...) # rolling approach: use previously-calculated p_norm to calculate new p_norm

Tables of Performances
Note: The func1 and func2 in tables below is slightly different than the ones proposed in the first comment as they are now taking care of p==1 as well.

aamp, m=50, p=1 n=1000 n=5000 n=10000 n=25000 n=50000
main 0.003 0.046 0.186 1.16 5.58
using func1 0.004 0.05 0.186 1.142 5.37
using func1 and func2 0.002 0.023 0.09 0.56 2.19
aamp, m=50, p=2 n=1000 n=5000 n=10000 n=25000 n=50000
main 0.009 0.2 0.81 5.38 25.88
using func1 0.132 0.21 0.81 5.11 25.13
using func1 and func2 0.003 0.03 0.087 0.54 2.2

Note that func2 has more impact compared to func1. That is because we need to directly calculate p_norm for just a few cases, and for most pair of subsequences, we just use the rolling approach, i.e. func2, to calculate p_norm.

The improvement in computing time for p=1 and n=50_000 is 60%
The improvement in computing time for p=2 and n=50_000 is 90%


gpu_aamp

In gpu_aamp.py, we can see the following lines:

PART-I:

# in function gpu_aamp

for idx, start in enumerate(range(0, l, step)):
    stop = min(l, start + step)
    
    p_norm = np.power(core.mass_absolute(T_B[start : start + m], T_A, p=p), p)
    p_norm_first = np.power(core.mass_absolute(T_A[:m], T_B, p=p), p)

Here, we can investigate and see if we can improve the computing time. Note that we use core.mass_absolute, which calls scipy cdist and it might be already fast (or okay)

PART-II:

# in function _compute_and_update_PI_kernel

if compute_p_norm:
            p_norm_out[j] = (
                p_norm_in[j - 1]
                - abs(T_B[i - 1] - T_A[j - 1]) ** p
                + abs(T_B[i + m - 1] - T_A[j + m - 1]) ** p
            )

I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.


maamp

I can see the following cases in different parts of the codes

  • core.mass_absolute(..., p=p)

  • D = np.linalg.norm(disc_subseqs - disc_neighbors, axis=1, ord=p)

  • np.power(core.mass_absolute(...))

I feel the three items above should be fine. But, still, we should investigate.

  • we can also see the following block:
if idx % 2 == 0:
                # Even
                p_norm_even[i, j] = (
                    p_norm_odd[i, j - 1]
                    - abs(T[i, idx - 1] - T[i, j - 1]) ** p
                    + abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
                )
            else:
                # Odd
                p_norm_odd[i, j] = (
                    p_norm_even[i, j - 1]
                    - abs(T[i, idx - 1] - T[i, j - 1]) ** p
                    + abs(T[i, idx + m - 1] - T[i, j + m - 1]) ** p
                )

we may try to optimize it (this seems to be similar to the updating process).


For now, I just investigated aamp. So, I am not 100% sure about the things I said about gpu_aamp and maamp. Those are just my observations :)

@NimaSarajpoor
Copy link
Collaborator Author

@seanlaw
Please let me know what you think. I already created a branch in my local repo, and created and tested the new functions. Do you want to find/discuss all possible places for this optimization opportunity here in this issue page? Or, do you prefer a step-by-step PR?

@seanlaw
Copy link
Contributor

seanlaw commented Nov 21, 2022

Now, by using func1 and func2, we will have:

Great! I don't love rolling_update_p_norm as it isn't quite consistent with the other "rolling" functions that we've named.

I think this is similar to the updating process in _aamp, and thus we can use func2 to speed up the process when p==1 or p==2.

But func2 targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel, which targets GPUs

Please let me know what you think.

Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 21, 2022

I don't love rolling_update_p_norm as it isn't quite consistent with the other "rolling" functions that we've named.

Yeah...I know! I tried (a little bit) to find a better name but couldn't.

But func2 targets CPUs and so it cannot be used/called in _compute_and_update_PI_kernel, which targets GPUs

Right! The code is simple though and it should be easy to write it as device function. (but, we are then repeating ourselves)

Hmmm, now that I think about it. Let's put this one on pause for now until we've wrapped up all of the other more important stuff like fixing precision issues. I'm not sure that this is an immediate "value add" for users while many of the other stability related things are a definite value-add!

Sure :)

The thing that triggered my mind was that the computing time for aamp(..., p=2.0) was waaay more than stump, and that did not make sense to me. Because, rolling cov doesn't seem to have lower number of operations than rolling p_norm (when p==2.0). So, their running time should have been very close to each other.


I will leave it here, and will try to resume recording the computing time for top-k (normalized=False).

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Nov 21, 2022

@seanlaw
I know you told me to leave this for now... but I think you will find this VERY interesting!!!
(I do apologize for doing this last investigation 😄)

I realized that there is a huge difference in the computing time between p==2.0 and p==2 !!!

@njit(fastmath=True)
def my_func(a, p):
    n = len(a)
    x = 0
    for i in range(n):
        for j in range(n):
            x = x + (a[i] ** p) + (a[j] ** p)
    
    return x

And, the input:

seed = 0
np.random.seed(seed)
# n = 1000, 5000, 10000, 20000
a = np.random.rand(n)
n=1000 n=5000 n=10000 n=20000
case1, p-2.0 0.036 0.77 3.03 12.05
case2, p=2 0.008 0.14 0.52 2.04
improvement 78% 82% 83% 83%

I am not saying this is the only place that code can be optimized, but it seems that this is a very important thing!

Maybe, we do something similar to core.check_window_size(m, ...):

# in core.py
def check_order(p):
      if int(p) == p:
             return int(p)
      else:
         return p

p=1 p=2 p=3 p=4
p: float, n=20000 1.43 12.04 12.02 12.13
p: int, n=20000 1.43 2.05 2.06 2.8
improvement 0% 83% 83% 77%

Btw, I posted a question on numba discourse:
https://numba.discourse.group/t/strange-behavior-in-the-performance-of-operator-p/1658

@seanlaw seanlaw added the enhancement New feature or request label Nov 21, 2022
@seanlaw
Copy link
Contributor

seanlaw commented Nov 21, 2022

@NimaSarajpoor That is interesting! It seems that raising a float to a power of another float is more computationally expensive than raising it to a power of int

@JaKasb
Copy link

JaKasb commented Nov 21, 2022

If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x.
If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.

@seanlaw
Copy link
Contributor

seanlaw commented Nov 21, 2022

If p is an integer, the compiler can unroll the loop inside pow(x, 2) to x*x.
If p is a float, the compiler emits code that is valid for all possible float values; thererby not allowing many optimizations.

Very insightful! Thanks for sharing @JaKasb!

@seanlaw
Copy link
Contributor

seanlaw commented Nov 21, 2022

Maybe, we do something similar to core.check_window_size(m, ...):

@NimaSarajpoor I think we may be able to use a built-in Python function:

def _check_order(p):
    if float(p).is_integer():
        return int(p)
    else:
        return p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants