Computing matrix profile while considering a custom transformation for subsequences #942

NimaSarajpoor · 2024-01-07T04:53:29Z

Currently, STUMPY supports the parameter normalize which allows users to compute matrix profile for the following cases:

(1) normalize == False: Compute the distance between subsequences with no transformation
(2) normalize == True: z-normalize subsequences before computing the (Euclidean) distance

There have been a few interest in using a different transformation:

u-normalized version of the Matrix Profile #900: Requested a specific transformation, where the mean of each subsequence is being subtracted from it but there is no scaling
Centering rather than normalizing subsequences duing motif discovery #940: Requested another transformation, where the first element of each subsequence is being subtracted from it.

Usually, when the volume of data is small, it is better to just get all the subsequences, do the transformation, and then compute the full distance matrix (see discussion in #900). But, what if the volume of the data is large? In such case, having an efficient approach to compute the distance considering the custom transformation can be useful.

The text was updated successfully, but these errors were encountered:

NimaSarajpoor · 2024-01-07T05:12:14Z

The two examples above can be covered by the general case below:
For a given offset array alpha, transform the i-th subsequence, T[i:i+m], to T[i:i+m] - alpha[i], and compute the matrix profile accordingly.

If our distance function is Euclidean Distance (i.e. p==2 in p-norm distance), then I think we can revise the code to achieve this in efficient way, and preserve code's maintainability as well.

Let's say we have two subsequences $S_{i}$ and $S_{j}$, each with length m.

$S_{i} = [T_{i,0}, T_{i,1}, ..., T_{i,m-1}]$

$S_{j} = [T_{j,0}, T_{j,1}, ..., T_{j,m-1}]$

And, let's say we want to just apply some offset to them before computing their distance,
e.g applying $\alpha_{i}$ offset to $S_{i}$ and applying $\alpha_{j}$ offset to $S_{j}$

$x = dist(S_{i}, S_{j})$

$y = dist(S_{i} -\alpha_{i}, S_{j} -\alpha_{j})$

where dist is the Euclidean distance function. $x$ is the distance with no transformation, and $y$ is the distance we are looking for (i.e. the distance after transformation)

$y ^ 2 = \sum_{s=0:m}{[(T_{i,s} - \alpha_{i}) - (T_{j,s} -\alpha_{j})]^{2}}$

$y ^ 2 = \sum_{s=0:m}{[(T_{i,s} - T_{j,s}) - (\alpha_{i} -\alpha_{j})]^{2}}$

$y ^ 2 = \sum_{s=0:m}{(T_{i,s} - T_{j,s})^2} + \sum_{s=0:m}{(\alpha_{i} - \alpha_{j})^2} - \sum_{s=0:m}{2(\alpha_{i}- \alpha_{j})(T_{i,s} - T_{j,s})}$

Note that the firs term is just $x^{2}$, and the second term is $m (\alpha_{i} - \alpha_{j})^2$. Therefore:

$y ^ 2 = x ^ 2 + m (\alpha_{i} - \alpha_{j})^2 - \sum_{s=0:m}{2(\alpha_{i}- \alpha_{j})(T_{i,s} - T_{j,s})}$

$y ^ 2 = x ^ 2 + m (\alpha_{i} - \alpha_{j})^2 - 2(\alpha_{i}- \alpha_{j})\sum_{s=0:m}{(T_{i,s} - T_{j,s})}$

$y ^ 2 = x^2 + m (\alpha_{i} - \alpha_{j})^2 - 2(\alpha_{i}- \alpha_{j})(\sum_{s=0:m}{T_{i,s}} - \sum_{s=0:m}{T_{j,s}})$

$y ^ 2 = x ^2 + m (\alpha_{i} - \alpha_{j})^2 - 2(\alpha_{i}- \alpha_{j})(m\mu_{i} - m\mu_{j})$

$y ^ 2 = x^2 + m (\alpha_{i} - \alpha_{j})^2 - 2m(\alpha_{i} - \alpha_{j})(\mu_{i} - \mu_{j})$

Let alpha be an array of size len(T) - m + 1, where alpha[i] is the offset we want to apply to the i-th subsequence with length m. Then, we need to change this block of code

stumpy/stumpy/aamp.py

Lines 121 to 140 in 3559b38

    
           if uint64_i == 0 or uint64_j == 0: 
        
               p_norm = ( 
        
                   np.linalg.norm( 
        
                       T_B[uint64_j : uint64_j + uint64_m] 
        
                       - T_A[uint64_i : uint64_i + uint64_m], 
        
                       ord=p, 
        
                   ) 
        
                   ** p 
        
               ) 
        
           else: 
        
               p_norm = np.abs( 
        
                   p_norm 
        
                   - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1]) 
        
                   ** p 
        
                   + np.absolute( 
        
                       T_B[uint64_j + uint64_m - uint64_1] 
        
                       - T_A[uint64_i + uint64_m - uint64_1] 
        
                   ) 
        
                   ** p 
        
               )

to this:

    if uint64_i == 0 or uint64_j == 0:
        # KEEP Existing Code
        p_norm = (
            np.linalg.norm(
                T_B[uint64_j : uint64_j + uint64_m]
                - T_A[uint64_i : uint64_i + uint64_m],
                ord=p,
            )
            ** p
        )

        # [ADD New Code] POST-PROCESS, ONLY when p == 2, and there is user-defined offset 
        p_norm = (
            p_norm  
            + m * (alpha[uint64_i] - alpha[uint64_j]) ** 2  
            - 2 * m * (alpha[uint64_i] - alpha[uint64_j]) * (μ_A[uint64_i] -μ_B[uint64_j])
        )

    else:
        # Let's call the current p-norm distance `y`. It is the Euclidean distance after applying the "offset" transformation
        # and `x` is the p-norm distance with no transformation
         
        # Use the current y to compute its corresponding x (i.e. reverting the post-process in previous iteration) 
        # Then compute new x.
        # Finally, compute new y.

        # [ADD New Code] PRE-PROCESS, ONLY when p == 2, and there is user-defined offset 
        p_norm = (
            p_norm  
            - m * (alpha[uint64_i - uint64_1] - alpha[uint64_j - uint64_1]) ** 2  
            + 2 * m * (alpha[uint64_i - uint64_1] - alpha[uint64_j - uint64_1]) * (μ_A[uint64_i - uint64_1] -μ_B[uint64_j - uint64_1])
        )
       
        # KEEP Existing Code
        p_norm = np.abs(
            p_norm
            - np.absolute(T_B[uint64_j - uint64_1] - T_A[uint64_i - uint64_1])
            ** p
            + np.absolute(
                T_B[uint64_j + uint64_m - uint64_1]
                - T_A[uint64_i + uint64_m - uint64_1]
            )
            ** p
        )

        # [ADD New Code] POST-PROCESS, ONLY when p == 2, and there is user-defined offset 
        p_norm = (
            p_norm  
            + m * (alpha[uint64_i] - alpha[uint64_j]) ** 2  
            - 2 * m * (alpha[uint64_i] - alpha[uint64_j]) * (μ_A[uint64_i] -μ_B[uint64_j])
       )

seanlaw · 2024-01-07T18:13:55Z

Considering that we are only adding either a (negative) constant or some pre-determined set of values for each subsequence, I think we should consider new parameters called T_A_add and T_B_add (and possibly extend to T_A_divide and T_B_divide where possible in the future). "Transformer" sounds too vague/non-specific/open-ended. Doing X_add purposely limits what is possible. What do you think @NimaSarajpoor?

NimaSarajpoor · 2024-01-07T18:26:45Z

@seanlaw

"Transformer" sounds too vague/non-specific/open-ended.

Definitely right! 😄 T_A_add (T_B_add) is a better choice by far!

seanlaw · 2024-01-07T19:09:31Z

We will also need to consider what this means for the multi-dimensional case and it's affect on the MDL (minimum description length)

NimaSarajpoor · 2024-01-11T03:27:49Z

We will also need to consider what this means for the multi-dimensional case and it's affect on the MDL (minimum description length)

I tried to go through the source code. I realized that, in non-normalized MDL, we obtain the global min and max, and then we convert the subsequences as follows: (subsequence - min) / (max - min).

I think that we might be able to use the same approach after applying T_A_add to T_A. So, our max can be the max of rolling max of subsequeces, and our min will be the min of rolling min. (I think, at the end of the day, we want to apply the same transform to subsequence when we want to compute MDL).

But, how to test this proposal?
(1) We try to find a problem and then use the existing code aamp_mmotifs.py. Then, let's say min = 0, and max = 1. We can try a new min that is lower than min, and a new max that is larger than max. This can help us understand if the main purpose of this transformation is to make sure that (i) final values are being limited to a certain range, and (ii) we are using the same transform for all subsequences.

(2) We insert two similar patterns but with the same average, say 0, to a randomly-generated time series data, and then we try to see if we can detect It using the existing code. Then, if it works, we can move up / down one of the motifs, and try to capture those again using T_A_add.

One thing that is certain is that MDL is complicated 😄

seanlaw · 2024-01-12T00:26:09Z

I tried to go through the source code. I realized that, in non-normalized MDL, we obtain the global min and max, and then we convert the subsequences as follows: (subsequence - min) / (max - min).

Where exactly are you seeing this? I'm looking at maamp.mdl and core._mdl and I'm not seeing (subsequence - min) / (max - min)

NimaSarajpoor · 2024-01-12T01:26:37Z

In aamp_mmotifs.py, there is a call to the function maap_mdl. We can see the following block of code in maap_mdl.

stumpy/stumpy/maamp.py

Lines 275 to 281 in 3559b38

    
           if discretize_func is None: 
        
               T_isfinite = np.isfinite(T) 
        
               T_min = T[T_isfinite].min() 
        
               T_max = T[T_isfinite].max() 
        
               discretize_func = partial( 
        
                   _maamp_discretize, a_min=T_min, a_max=T_max, n_bit=n_bit 
        
               )

And the function _maamp_discretize is:

stumpy/stumpy/maamp.py

Lines 165 to 195 in 3559b38

    
           def _maamp_discretize(a, a_min, a_max, n_bit=8):  # pragma: no cover 
        
               """ 
        
               Discretize each row of the input array 
        
               This distribution is best suited for non-normalized time seris data 
        
               Parameters 
        
               ---------- 
        
               a : numpy.ndarray 
        
                   The input array 
        
               a_min : float 
        
                   The minimum value 
        
               a_max : float 
        
                   The maximum value 
        
               n_bit : int, default 8 
        
                   The number of bits to use for computing the bit size 
        
               Returns 
        
               ------- 
        
               out : numpy.ndarray 
        
                   Discretized array 
        
               """ 
        
               return ( 
        
                   np.round(((a - a_min) / (a_max - a_min)) * ((2**n_bit) - 1.0)).astype( 
        
                       np.int64 
        
                   ) 
        
                   + 1 
        
               )

And then, in maap_mdl, we have:

stumpy/stumpy/maamp.py

Lines 293 to 296 in 3559b38

    
           disc_subseqs = discretize_func(subseqs) 
        
           disc_neighbors = discretize_func(neighbors) 
        
           D = np.linalg.norm(disc_subseqs - disc_neighbors, axis=1, ord=p)

seanlaw · 2024-01-12T01:37:31Z

I think that we might be able to use the same approach after applying T_A_add to T_A. So, our max can be the max of rolling max of subsequeces, and our min will be the min of rolling min. (I think, at the end of the day, we want to apply the same transform to subsequence when we want to compute MDL).

I have no idea. I can't remember why we used Min Max Scaling for the discretization :(

This distribution is best suited for non-normalized time seris data

I'm not sure where this statement came from 😆

NimaSarajpoor · 2024-01-12T02:21:30Z

@seanlaw

I have no idea. I can't remember why we used Min Max Scaling for the discretization :(

I will do some research to see if I can find something.

I'm not sure where this statement came from

And, this is from you who is very careful. And, now I can better see why you try to be very careful in things that need to be maintained later. Even with that level of due diligence, there are things that may go slightly wrong (By "wrong", I do not mean incorrect. I mean it may lose its clarity).

seanlaw · 2024-01-12T02:34:05Z

I remember now! In one of the other MDL comments, I had asked the question:

What happens if the subsequence were not z-normalized at all?

Michael Yeh actually responded:

As split_pt stores z-scores, the discretization function won't work if the time series is not z-normalized.

I would suggest using the min max discretization function for time series without normalization with the min and max estimated from the whole time series (instead of the subsequence). It may require further exploration to determine the ideal discretization function under different scenarios.

Then, I later discovered:

For non-normalized time series, It looks like I can apply Definition 3 from this paper to discretize the time series

So, at least it is based on published work. I still don't feel comfortable discretizing a subsequence that is transformed by T_add... as I don't know the motivation for the discretization function and what makes a "good" discretization function

NimaSarajpoor mentioned this issue Jan 16, 2024

Some issues in Tutorial_Multidimensional_Motif_Discovery and MDL #945

Closed

seanlaw added the enhancement New feature or request label Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computing matrix profile while considering a custom transformation for subsequences #942

Computing matrix profile while considering a custom transformation for subsequences #942

NimaSarajpoor commented Jan 7, 2024 •

edited

NimaSarajpoor commented Jan 7, 2024 •

edited

seanlaw commented Jan 7, 2024

NimaSarajpoor commented Jan 7, 2024

seanlaw commented Jan 7, 2024

NimaSarajpoor commented Jan 11, 2024

seanlaw commented Jan 12, 2024

NimaSarajpoor commented Jan 12, 2024 •

edited

seanlaw commented Jan 12, 2024

NimaSarajpoor commented Jan 12, 2024

seanlaw commented Jan 12, 2024 •

edited

Computing matrix profile while considering a custom transformation for subsequences #942

Computing matrix profile while considering a custom transformation for subsequences #942

Comments

NimaSarajpoor commented Jan 7, 2024 • edited

NimaSarajpoor commented Jan 7, 2024 • edited

seanlaw commented Jan 7, 2024

NimaSarajpoor commented Jan 7, 2024

seanlaw commented Jan 7, 2024

NimaSarajpoor commented Jan 11, 2024

seanlaw commented Jan 12, 2024

NimaSarajpoor commented Jan 12, 2024 • edited

seanlaw commented Jan 12, 2024

NimaSarajpoor commented Jan 12, 2024

seanlaw commented Jan 12, 2024 • edited

NimaSarajpoor commented Jan 7, 2024 •

edited

NimaSarajpoor commented Jan 7, 2024 •

edited

NimaSarajpoor commented Jan 12, 2024 •

edited

seanlaw commented Jan 12, 2024 •

edited