Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top_k_motifs : Returns (partially) wrong neighbors #12

Open
tcrasset opened this issue Feb 20, 2020 · 1 comment
Open

Top_k_motifs : Returns (partially) wrong neighbors #12

tcrasset opened this issue Feb 20, 2020 · 1 comment

Comments

@tcrasset
Copy link

Description

I didn't do anything per se, I rather implemented this algorithm in Java and I have some useful tweaks.

What I Observed

The implemented algorithm uses the numpy.argpartition() method.
As the documentation of this method states :

The k-th element will be in its final sorted position and all smaller elements will be moved before it
and all larger elements behind it. The order all elements in the partitions is undefined.

Concretely, we sort the distance_profile and we get the indices using
indices = np.argpartition(tmp, k)

However in the for loop, we go through every index, so it could be that the first k indices are trivial matches or get excluded in the subsequent iterations and the "motifs" that get picked as nearest neighbors are motifs that are in indices[k:] and thus the order of them is undefined i.e. unsorted. We thus end up with neighbors that are not really neighbors.
One fix for this is to just loop over indices[0:k] and notify the user that less that k neighbors were found if some are excluded.

Moreover, I think one should apply the exclusion zone after one adds an index to the found indices, so putting the code inside of the if condition.
Let's assume we have

exclusion_zone = 2
k = 5
indices = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 ,15]
distance_profile = [Inf, Inf, Inf, x, x, x, ...]

The for loop will go through the indices array, see that index 1 is excluded, and then exclude in again. All good until now.
However, once the loop gets to index 2, as the exclusion zone gets applied on either side of the index, index 4 gets excluded.
Arriving at index 3, it is excluded, and thus is excludes it again and index 5 is then put into the exclusion zone.
This goes on until all indices get consumed and the found indices remains empty, whereas it should have returned at least the index 4 as a neighbor.
I know this is an artificial example, I didn't test it using your library. I didn't read the calling code, but this could be a non-issue if the query is already excluded from the distance profile (I think, but I am not sure). However, the easiest way would be to put the exclusion zone inside of the if statement.

@RexKing6
Copy link

I think top k here is not precise because of np.argpartition(). It may be replaced with np.sort() or other functions to get the precise neighbors. My idea is that we should check if top-2 in exclusion zone after appending top-1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants