Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up scrump #643

Open
NimaSarajpoor opened this issue Jul 15, 2022 · 0 comments
Open

Speeding up scrump #643

NimaSarajpoor opened this issue Jul 15, 2022 · 0 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@NimaSarajpoor
Copy link
Collaborator

NimaSarajpoor commented Jul 15, 2022

I have an idea that might speed up scrump when its parameter pre_scrump is set to True. (After some thought, I concluded that this reduction in computing time would be probably insignificant. However, I am going to share my idea here anyway)

When pre_scrump=True, we calculate an approximate matrix profile by going through some samples (indices):

indices = np.random.permutation(range(0, l, s)).astype(np.int64)

(note: l = n - m + 1)

And, for each sample i:

  • we calculate its distance profile.
  • we also use QT to calculate the distance between S_(i + j) and S_(nn_i + j) for j in range(1, s)
  • we also use QT to calculate the distance between S_(i - j) and S_(nn_i - j) for j in range(1, s)

So, of Combination(l, 2) cells of distance matrix (in self-join), we roughly calculate:

  • $\frac{l}{s} \times l$ (regarding distance profile)
  • $\frac{l}{s} \times s$ (regarding move forward)
  • $\frac{l}{s} \times s$ (regarding move backward)

So, at the end of prescrump process, we already calculated roughly $\frac{1}{s} l^{2} + 2l$ elements of distance matrix. I think if we assume l is large, we can then say:

  • we need to calculate $\frac{1}{2} l^{2}$ cells to obtain the full distance matrix (in self-join)
  • At the end of prescrump, we already calculated $\frac{1}{s} l^{2}$ cells of them.

So, if we avoid re-calculating $\frac{1}{s} l^{2}$ number of distances throughtout .update() process, then:

$$ \frac{\frac{1}{s} l^{2}}{\frac{1}{2} l^{2}} = \frac{2}{s}$$

So, if l = 100_000 and s = 100 (0.001 of l), we are reducing the computation time by 2% (?!). This is not just for a single update but rather for several calls of update that complete the distance matrix. So, I think the improvement is probably insignificant for a single call of update method.

I am not sure about my rough calculation though :)


I think keeping track of what pair-wise distances have been calculated can also help us avoid the duplicate issue that might appear in the top-k matrix profiles as discussed in PR #595.

@seanlaw seanlaw added enhancement New feature or request question Further information is requested labels Jul 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants