-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when allow_single_cluster used with cluster_selection_epsilon #370
Comments
I'm also experiencing this issue. only happens when using In ._hdbscan_tree.pyx the cluster_tree is empty because all of the children of the root node have child size == 1. 'leaf' method can get away with this because 'leaves' is set to empty as well. However, 'eom_clusters' contains the root node and epsilon_search() relies on cluster_tree only searching child nodes. I think we want to disallow the root node from being in the 'eom_clusters' list? |
@danielzgtg please see my pull request for a possible fix to your problem. I am not positive that it does not affect the underlying algorithm, but I did my best to find a solution and I'm pretty confident. |
fixes scikit-learn-contrib#370 original fix for 370 was not fully tested
fixes scikit-learn-contrib#370 original fix for 370 was not fully tested
So after reading the HDBSCAN* paper by Malzer and Baum, I think that probably my fix for 370 was not algorithmically correct; the fix allows us to run with that parameter, but does not actually allow the algorithm to return exactly one single cluster. The description under section 4.3 states that describes the merging based on epsilon "If we find one before reaching the root, we select it as a cluster and unselect all of its descendants". @lmcinnes or @cmalzer would you be able to confirm that for the combination of "allow_single_cluster" and "cluster_selection_epsilon" that we want to allow reaching the root node during the upwards traversal of the hierarchy? Or if that has unintended consequences? This seems like the correct move. I have made a new commit that I think implements it correctly, and fixes some incorrect behavior I was seeing. I apologize for not doing more thorough research before making my initial commit. |
Hi, thanks for pointing that out - I forgot to implement cluster_selection_epsilon for the "allow_single_cluster" option. (The algorithm described in the paper only considered the case where we want to partition the data set) But I had a look at the latest commit and returning the root in "traverse_upwards" if allow_single_cluster is set to True should indeed fix the issue. Thank you! |
Thanks for the quick reply and for your work in designing this algorithm! |
@lmcinnes Unfortunately, this issue is not resolved. The second pull request I made passed a unit test that I was using and allows single clusters in most cases, but there is an edge case with the code provided by @danielzgtg. I tried to do some debugging and the problem lies appears to be when |
I just had a look at it. The problem is indeed that cluster_tree is empty at this point. Therefore, cluster_tree['lambda_val'][cluster_tree['child'] == leaf] is an empty array and you will get an IndexError when trying to access the first element ([0]). The reason why cluster_tree is empty is because of the line "cluster_tree = tree[tree['child_size'] > 1]". The tree only contains a parent node with children of size 1. The code works for cluster_selection_method="leaf" because the list of leaves is empty. For "eom", the parent cluster is passed as "leaf". This could be solved by checking whether cluster_tree is empty before doing the epsilon_search. For example, "if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0: ... " |
I just tested it with your suggestion of How about this?
Honestly I'm a little confused. When we want to allow single cluster, does that mean we want to allow it to be just the noise cluster? In that case it would be :
|
Seems like I tested with the wrong number of random data points, and then got the same error, but with a different cause.
(Sorry for the many edits) |
Thanks for the fast response! That sounds fine. I'll put in a pull request tonight. |
…t node in leaf list
I put up a pull request with your suggestion. The reason I had the check inside the If we can't make that assumption then I will change the conditional to
because if the root is present at all in the list and we're allow it as an output then it will by default become the only epsilon child returned from the search. |
If HDBSCAN selects a cluster, then it also unselects all of its descendants. Therefore, it actually should not happen that the root cluster AND other clusters are in the list. However, your suggestion of "if .... in eom_clusters" is much easier to read than my version, so I would actually prefer that line of code - although it might be a little bit faster to check the length first because otherwise we search the whole list of clusters for the root node every time we use the epsilon parameter, even though that is a rare case. But I'm not sure how much of a performance difference that makes... (maybe checking the length takes just as long anyway?) I'm fine with either way. |
Thanks for clarifying! I think maybe as a middle ground I'll leave it as is for performance and put a comment above. |
…t node in leaf list
fixes #370 -- epsilon search would crash when root node in leaf list
I ran into this same issue using the released 0.8.26 (pip installed). Any plan to release another version with the fixes? The release hasn't been updated for a while now. |
I am experiencing the problem with allow_single_cluster=True, it suppose to return single cluster but I am always getting 3 clusters, -1 for noise , 0 for cluster1 and 1 cluster 2, using hdbscan 0.8.29. Any idea why this is happening? |
When I use
cluster_selection_epsilon=x
where x > 0 andallow_single_cluster=True
together, HDBSCAN crashes.I am using those two options together to try and get the
no_structure
toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only usecluster_selection_epsilon
, multiple clusters appear in that square. When I only useallow_single_cluster=True
, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.Code
Expected behavior
HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.
Actual behavior
HDBSCAN crashes with the following traceback:
The text was updated successfully, but these errors were encountered: