Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when allow_single_cluster used with cluster_selection_epsilon #370

Closed
danielzgtg opened this issue Apr 21, 2020 · 15 comments · Fixed by #385
Closed

Crash when allow_single_cluster used with cluster_selection_epsilon #370

danielzgtg opened this issue Apr 21, 2020 · 15 comments · Fixed by #385

Comments

@danielzgtg
Copy link

When I use cluster_selection_epsilon=x where x > 0 and allow_single_cluster=True together, HDBSCAN crashes.

I am using those two options together to try and get the no_structure toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only use cluster_selection_epsilon, multiple clusters appear in that square. When I only use allow_single_cluster=True, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.

Code

import numpy as np
import hdbscan

if __name__ == '__main__':
    no_structure = np.random.rand(1500, 2)
    clusterer = hdbscan.HDBSCAN(min_cluster_size=15, cluster_selection_epsilon=3, allow_single_cluster=True)
    clusterer.fit(no_structure)

Expected behavior

HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.

Actual behavior

HDBSCAN crashes with the following traceback:

Traceback (most recent call last):
  File "/home/home/PycharmProjects/sandbox/crash_example.py", line 7, in <module>
    clusterer.fit(no_structure)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 632, in hdbscan
    return _tree_to_labels(X,
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 59, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(condensed_tree,
  File "hdbscan/_hdbscan_tree.pyx", line 645, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 631, in hdbscan._hdbscan_tree.epsilon_search
IndexError: index 0 is out of bounds for axis 0 with size 0
@neontty
Copy link
Collaborator

neontty commented Apr 24, 2020

I'm also experiencing this issue. only happens when using cluster_selection_method='eom' .

In ._hdbscan_tree.pyx the cluster_tree is empty because all of the children of the root node have child size == 1. 'leaf' method can get away with this because 'leaves' is set to empty as well.

However, 'eom_clusters' contains the root node and epsilon_search() relies on cluster_tree only searching child nodes. I think we want to disallow the root node from being in the 'eom_clusters' list?

@neontty
Copy link
Collaborator

neontty commented Apr 24, 2020

@danielzgtg please see my pull request for a possible fix to your problem. I am not positive that it does not affect the underlying algorithm, but I did my best to find a solution and I'm pretty confident.

@neontty neontty mentioned this issue Apr 27, 2020
neontty added a commit to neontty/hdbscan that referenced this issue May 26, 2020
fixes scikit-learn-contrib#370
original fix for 370 was not fully tested
neontty added a commit to neontty/hdbscan that referenced this issue May 29, 2020
fixes scikit-learn-contrib#370
original fix for 370 was not fully tested
@neontty
Copy link
Collaborator

neontty commented May 29, 2020

So after reading the HDBSCAN* paper by Malzer and Baum, I think that probably my fix for 370 was not algorithmically correct; the fix allows us to run with that parameter, but does not actually allow the algorithm to return exactly one single cluster.

The description under section 4.3 states that describes the merging based on epsilon "If we find one before reaching the root, we select it as a cluster and unselect all of its descendants".

@lmcinnes or @cmalzer would you be able to confirm that for the combination of "allow_single_cluster" and "cluster_selection_epsilon" that we want to allow reaching the root node during the upwards traversal of the hierarchy? Or if that has unintended consequences? This seems like the correct move.

I have made a new commit that I think implements it correctly, and fixes some incorrect behavior I was seeing. I apologize for not doing more thorough research before making my initial commit.

@neontty neontty mentioned this issue May 29, 2020
@cmalzer
Copy link
Collaborator

cmalzer commented May 29, 2020

Hi, thanks for pointing that out - I forgot to implement cluster_selection_epsilon for the "allow_single_cluster" option. (The algorithm described in the paper only considered the case where we want to partition the data set)

But I had a look at the latest commit and returning the root in "traverse_upwards" if allow_single_cluster is set to True should indeed fix the issue. Thank you!

@neontty
Copy link
Collaborator

neontty commented May 29, 2020

Thanks for the quick reply and for your work in designing this algorithm!

@neontty
Copy link
Collaborator

neontty commented Jun 23, 2020

@lmcinnes Unfortunately, this issue is not resolved.

The second pull request I made passed a unit test that I was using and allows single clusters in most cases, but there is an edge case with the code provided by @danielzgtg.

I tried to do some debugging and the problem lies appears to be when cluster_tree['lambda_val] is empty on _hdbscan_tree.pyx:638 . I'm having trouble debugging the cython though (still pretty new to cython). If anyone has time to help investigate this, I would greatly appreciate it.

@cmalzer
Copy link
Collaborator

cmalzer commented Jun 23, 2020

I just had a look at it. The problem is indeed that cluster_tree is empty at this point. Therefore, cluster_tree['lambda_val'][cluster_tree['child'] == leaf] is an empty array and you will get an IndexError when trying to access the first element ([0]).

The reason why cluster_tree is empty is because of the line "cluster_tree = tree[tree['child_size'] > 1]". The tree only contains a parent node with children of size 1.

The code works for cluster_selection_method="leaf" because the list of leaves is empty. For "eom", the parent cluster is passed as "leaf".

This could be solved by checking whether cluster_tree is empty before doing the epsilon_search. For example, "if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0: ... "

@neontty
Copy link
Collaborator

neontty commented Jun 25, 2020

I just tested it with your suggestion of cluster_tree.shape[0] > 0 in the conditional , but that fails. After doing a little bit of printf debugging, the cluster_tree had ~26 elements in it, and eom_clusters has exactly one, which is the root node. So, when we evaluate cluster_tree['child'] == leaf on line 637, leaf is the root node and it matches no child entries.

How about this?

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      
                                                                                      
    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          
                                                                                      
    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if allow_single_cluster:                                                  
                selected_clusters.append(leaf)                                        
                break                                                                 
            continue                                                                  
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:                                           
            if leaf not in processed:                                                 

Honestly I'm a little confused. When we want to allow single cluster, does that mean we want to allow it to be just the noise cluster? In that case it would be :

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      
                                                                                      
    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          
                                                                                      
    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if not allow_single_cluster:                                              
                selected_clusters.append(leaf)                                        
            break                                                                     
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:  

@cmalzer
Copy link
Collaborator

cmalzer commented Jun 25, 2020

Seems like I tested with the wrong number of random data points, and then got the same error, but with a different cause.
There are several ways how this can be fixed; it might actually be a little bit faster to do the check outside of epsilon_search, for example something like

if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0:

            eom_clusters = [c for c in is_cluster if is_cluster[c]]
            if not (len(eom_clusters) == 1 and eom_clusters[0] == cluster_tree['parent'].min()):
                selected_clusters = epsilon_search(set(eom_clusters), cluster_tree, cluster_selection_epsilon, allow_single_cluster)

(Sorry for the many edits)

@neontty
Copy link
Collaborator

neontty commented Jun 25, 2020

Thanks for the fast response! That sounds fine. I'll put in a pull request tonight.

@neontty
Copy link
Collaborator

neontty commented Jul 4, 2020

@cmalzer

I put up a pull request with your suggestion.

The reason I had the check inside the epsilon_search() method is because I was unsure whether or not we could guarantee that when the root node is present in eom_clusters that it is the only item in that list. Otherwise if we only check that one case, the root node will still cause an error in epsilon_search() if it is anywhere present in the eom_clusters list with a mix of other leaves.

If we can't make that assumption then I will change the conditional to

if (cluster_tree['parent'].min() in eom_clusters)

because if the root is present at all in the list and we're allow it as an output then it will by default become the only epsilon child returned from the search.

@cmalzer
Copy link
Collaborator

cmalzer commented Jul 5, 2020

If HDBSCAN selects a cluster, then it also unselects all of its descendants. Therefore, it actually should not happen that the root cluster AND other clusters are in the list. However, your suggestion of "if .... in eom_clusters" is much easier to read than my version, so I would actually prefer that line of code - although it might be a little bit faster to check the length first because otherwise we search the whole list of clusters for the root node every time we use the epsilon parameter, even though that is a rare case. But I'm not sure how much of a performance difference that makes... (maybe checking the length takes just as long anyway?) I'm fine with either way.

@neontty
Copy link
Collaborator

neontty commented Jul 7, 2020

Thanks for clarifying! I think maybe as a middle ground I'll leave it as is for performance and put a comment above.

neontty added a commit to neontty/hdbscan that referenced this issue Oct 9, 2020
neontty added a commit that referenced this issue Oct 10, 2020
fixes #370 -- epsilon search would crash when root node in leaf list
@jhwang7628
Copy link

I ran into this same issue using the released 0.8.26 (pip installed). Any plan to release another version with the fixes? The release hasn't been updated for a while now.

@nishi999
Copy link

nishi999 commented Apr 5, 2023

I am experiencing the problem with allow_single_cluster=True, it suppose to return single cluster but I am always getting 3 clusters, -1 for noise , 0 for cluster1 and 1 cluster 2, using hdbscan 0.8.29. Any idea why this is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants