Crash when allow_single_cluster used with cluster_selection_epsilon #370

danielzgtg · 2020-04-21T00:05:05Z

When I use cluster_selection_epsilon=x where x > 0 and allow_single_cluster=True together, HDBSCAN crashes.

I am using those two options together to try and get the no_structure toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only use cluster_selection_epsilon, multiple clusters appear in that square. When I only use allow_single_cluster=True, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.

Code

import numpy as np
import hdbscan

if __name__ == '__main__':
    no_structure = np.random.rand(1500, 2)
    clusterer = hdbscan.HDBSCAN(min_cluster_size=15, cluster_selection_epsilon=3, allow_single_cluster=True)
    clusterer.fit(no_structure)

Expected behavior

HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.

Actual behavior

HDBSCAN crashes with the following traceback:

Traceback (most recent call last):
  File "/home/home/PycharmProjects/sandbox/crash_example.py", line 7, in <module>
    clusterer.fit(no_structure)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 632, in hdbscan
    return _tree_to_labels(X,
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 59, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(condensed_tree,
  File "hdbscan/_hdbscan_tree.pyx", line 645, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 631, in hdbscan._hdbscan_tree.epsilon_search
IndexError: index 0 is out of bounds for axis 0 with size 0

The text was updated successfully, but these errors were encountered:

neontty · 2020-04-24T19:43:40Z

I'm also experiencing this issue. only happens when using cluster_selection_method='eom' .

In ._hdbscan_tree.pyx the cluster_tree is empty because all of the children of the root node have child size == 1. 'leaf' method can get away with this because 'leaves' is set to empty as well.

However, 'eom_clusters' contains the root node and epsilon_search() relies on cluster_tree only searching child nodes. I think we want to disallow the root node from being in the 'eom_clusters' list?

neontty · 2020-04-24T20:48:31Z

@danielzgtg please see my pull request for a possible fix to your problem. I am not positive that it does not affect the underlying algorithm, but I did my best to find a solution and I'm pretty confident.

fixes scikit-learn-contrib#370 original fix for 370 was not fully tested

neontty · 2020-05-29T04:29:44Z

So after reading the HDBSCAN* paper by Malzer and Baum, I think that probably my fix for 370 was not algorithmically correct; the fix allows us to run with that parameter, but does not actually allow the algorithm to return exactly one single cluster.

The description under section 4.3 states that describes the merging based on epsilon "If we find one before reaching the root, we select it as a cluster and unselect all of its descendants".

@lmcinnes or @cmalzer would you be able to confirm that for the combination of "allow_single_cluster" and "cluster_selection_epsilon" that we want to allow reaching the root node during the upwards traversal of the hierarchy? Or if that has unintended consequences? This seems like the correct move.

I have made a new commit that I think implements it correctly, and fixes some incorrect behavior I was seeing. I apologize for not doing more thorough research before making my initial commit.

cmalzer · 2020-05-29T07:02:17Z

Hi, thanks for pointing that out - I forgot to implement cluster_selection_epsilon for the "allow_single_cluster" option. (The algorithm described in the paper only considered the case where we want to partition the data set)

But I had a look at the latest commit and returning the root in "traverse_upwards" if allow_single_cluster is set to True should indeed fix the issue. Thank you!

neontty · 2020-05-29T13:38:30Z

Thanks for the quick reply and for your work in designing this algorithm!

neontty · 2020-06-23T03:28:42Z

@lmcinnes Unfortunately, this issue is not resolved.

The second pull request I made passed a unit test that I was using and allows single clusters in most cases, but there is an edge case with the code provided by @danielzgtg.

I tried to do some debugging and the problem lies appears to be when cluster_tree['lambda_val] is empty on _hdbscan_tree.pyx:638 . I'm having trouble debugging the cython though (still pretty new to cython). If anyone has time to help investigate this, I would greatly appreciate it.

cmalzer · 2020-06-23T08:18:21Z

I just had a look at it. The problem is indeed that cluster_tree is empty at this point. Therefore, cluster_tree['lambda_val'][cluster_tree['child'] == leaf] is an empty array and you will get an IndexError when trying to access the first element ([0]).

The reason why cluster_tree is empty is because of the line "cluster_tree = tree[tree['child_size'] > 1]". The tree only contains a parent node with children of size 1.

The code works for cluster_selection_method="leaf" because the list of leaves is empty. For "eom", the parent cluster is passed as "leaf".

This could be solved by checking whether cluster_tree is empty before doing the epsilon_search. For example, "if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0: ... "

neontty · 2020-06-25T03:51:13Z

I just tested it with your suggestion of cluster_tree.shape[0] > 0 in the conditional , but that fails. After doing a little bit of printf debugging, the cluster_tree had ~26 elements in it, and eom_clusters has exactly one, which is the root node. So, when we evaluate cluster_tree['child'] == leaf on line 637, leaf is the root node and it matches no child entries.

How about this?

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      
                                                                                      
    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          
                                                                                      
    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if allow_single_cluster:                                                  
                selected_clusters.append(leaf)                                        
                break                                                                 
            continue                                                                  
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:                                           
            if leaf not in processed:

Honestly I'm a little confused. When we want to allow single cluster, does that mean we want to allow it to be just the noise cluster? In that case it would be :

cpdef set epsilon_search(set leaves, np.ndarray cluster_tree, np.double_t cluster_sel\
ection_epsilon, np.intp_t allow_single_cluster):                                      
                                                                                      
    selected_clusters = list()                                                        
    processed = list()                                                                
    root_node = cluster_tree['parent'].min()                                          
                                                                                      
    for leaf in leaves:                                                               
        if leaf == root_node:                                                         
            if not allow_single_cluster:                                              
                selected_clusters.append(leaf)                                        
            break                                                                     
        eps = 1/cluster_tree['lambda_val'][cluster_tree['child'] == leaf][0]          
        if eps < cluster_selection_epsilon:

cmalzer · 2020-06-25T07:26:45Z

Seems like I tested with the wrong number of random data points, and then got the same error, but with a different cause.
There are several ways how this can be fixed; it might actually be a little bit faster to do the check outside of epsilon_search, for example something like

if cluster_selection_epsilon != 0.0 and cluster_tree.shape[0] > 0:

            eom_clusters = [c for c in is_cluster if is_cluster[c]]
            if not (len(eom_clusters) == 1 and eom_clusters[0] == cluster_tree['parent'].min()):
                selected_clusters = epsilon_search(set(eom_clusters), cluster_tree, cluster_selection_epsilon, allow_single_cluster)

(Sorry for the many edits)

neontty · 2020-06-25T13:21:02Z

Thanks for the fast response! That sounds fine. I'll put in a pull request tonight.

…t node in leaf list

neontty · 2020-07-04T17:52:02Z

@cmalzer

I put up a pull request with your suggestion.

The reason I had the check inside the epsilon_search() method is because I was unsure whether or not we could guarantee that when the root node is present in eom_clusters that it is the only item in that list. Otherwise if we only check that one case, the root node will still cause an error in epsilon_search() if it is anywhere present in the eom_clusters list with a mix of other leaves.

If we can't make that assumption then I will change the conditional to

if (cluster_tree['parent'].min() in eom_clusters)

because if the root is present at all in the list and we're allow it as an output then it will by default become the only epsilon child returned from the search.

cmalzer · 2020-07-05T08:52:33Z

If HDBSCAN selects a cluster, then it also unselects all of its descendants. Therefore, it actually should not happen that the root cluster AND other clusters are in the list. However, your suggestion of "if .... in eom_clusters" is much easier to read than my version, so I would actually prefer that line of code - although it might be a little bit faster to check the length first because otherwise we search the whole list of clusters for the root node every time we use the epsilon parameter, even though that is a rare case. But I'm not sure how much of a performance difference that makes... (maybe checking the length takes just as long anyway?) I'm fine with either way.

neontty · 2020-07-07T04:04:24Z

Thanks for clarifying! I think maybe as a middle ground I'll leave it as is for performance and put a comment above.

…t node in leaf list

fixes #370 -- epsilon search would crash when root node in leaf list

jhwang7628 · 2021-09-11T00:22:48Z

I ran into this same issue using the released 0.8.26 (pip installed). Any plan to release another version with the fixes? The release hasn't been updated for a while now.

nishi999 · 2023-04-05T14:06:47Z

I am experiencing the problem with allow_single_cluster=True, it suppose to return single cluster but I am always getting 3 clusters, -1 for noise , 0 for cluster1 and 1 cluster 2, using hdbscan 0.8.29. Any idea why this is happening?

neontty mentioned this issue Apr 27, 2020

fix issue 370 #371

Merged

neontty added a commit to neontty/hdbscan that referenced this issue May 26, 2020

var undefined in !allow_single_cluster codepath

9c6aa07

fixes scikit-learn-contrib#370 original fix for 370 was not fully tested

neontty mentioned this issue May 28, 2020

min_cluster_selection_eps not always respected #383

Closed

neontty added a commit to neontty/hdbscan that referenced this issue May 29, 2020

var undefined in !allow_single_cluster codepath

98280c1

fixes scikit-learn-contrib#370 original fix for 370 was not fully tested

neontty mentioned this issue May 29, 2020

var undefined in !allow_single_cluster codepath #384

Closed

neontty mentioned this issue May 29, 2020

Issue 370 #385

Merged

lmcinnes closed this as completed in #385 May 30, 2020

neontty added a commit to neontty/hdbscan that referenced this issue Jul 4, 2020

fixes scikit-learn-contrib#370 -- epsilon search would crash when roo…

bd65826

…t node in leaf list

neontty mentioned this issue Jul 4, 2020

fixes #370 -- epsilon search would crash when root node in leaf list #400

Merged

neontty added a commit to neontty/hdbscan that referenced this issue Oct 9, 2020

fixes scikit-learn-contrib#370 -- epsilon search would crash when roo…

c41f641

…t node in leaf list

neontty mentioned this issue Oct 9, 2020

cluster_selection_epsilon does not merge noise at root level #417

Closed

neontty added a commit that referenced this issue Oct 10, 2020

Merge pull request #400 from neontty/issue-370.2

8f05718

fixes #370 -- epsilon search would crash when root node in leaf list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash when allow_single_cluster used with cluster_selection_epsilon #370

Crash when allow_single_cluster used with cluster_selection_epsilon #370

danielzgtg commented Apr 21, 2020

neontty commented Apr 24, 2020 •

edited

neontty commented Apr 24, 2020

neontty commented May 29, 2020

cmalzer commented May 29, 2020 •

edited

neontty commented May 29, 2020

neontty commented Jun 23, 2020

cmalzer commented Jun 23, 2020

neontty commented Jun 25, 2020

cmalzer commented Jun 25, 2020 •

edited

neontty commented Jun 25, 2020

neontty commented Jul 4, 2020

cmalzer commented Jul 5, 2020 •

edited

neontty commented Jul 7, 2020

jhwang7628 commented Sep 11, 2021

nishi999 commented Apr 5, 2023 •

edited

Crash when allow_single_cluster used with cluster_selection_epsilon #370

Crash when allow_single_cluster used with cluster_selection_epsilon #370

Comments

danielzgtg commented Apr 21, 2020

Code

Expected behavior

Actual behavior

neontty commented Apr 24, 2020 • edited

neontty commented Apr 24, 2020

neontty commented May 29, 2020

cmalzer commented May 29, 2020 • edited

neontty commented May 29, 2020

neontty commented Jun 23, 2020

cmalzer commented Jun 23, 2020

neontty commented Jun 25, 2020

cmalzer commented Jun 25, 2020 • edited

neontty commented Jun 25, 2020

neontty commented Jul 4, 2020

cmalzer commented Jul 5, 2020 • edited

neontty commented Jul 7, 2020

jhwang7628 commented Sep 11, 2021

nishi999 commented Apr 5, 2023 • edited

neontty commented Apr 24, 2020 •

edited

cmalzer commented May 29, 2020 •

edited

cmalzer commented Jun 25, 2020 •

edited

cmalzer commented Jul 5, 2020 •

edited

nishi999 commented Apr 5, 2023 •

edited