Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prediction data not honoring cluster_selection_epsilon #586

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

n9Mtq4
Copy link

@n9Mtq4 n9Mtq4 commented Mar 22, 2023

This PR fixes an issue with prediction data not using cluster_selection_epsilon. This bug surfaces with wrong predictions from approximate_predict and incorrect exemplars_.

Code to reproduce the problem:

import hdbscan
from sklearn.datasets import make_blobs

blobs, _ = make_blobs(100, n_features=8, centers=10, random_state=42)

# use a high epsilon to force fewer clusters. real world data this happens more easily
clusterer = hdbscan.HDBSCAN(cluster_selection_epsilon=12.0, prediction_data=True)
clusterer.fit(blobs)

# 7 clusters from labels
clusterer.labels_.max() + 1
# 10 clusters from exemplars
len(clusterer.exemplars_)
# [5, 4, 3, 0, 5, 5, 6, 0, 5, 1]
clusterer.labels_[:10]
# predicting assigns points to completely different clusters (and number of clusters!)
# [6, 5, 4, 0, 6, 6, 9, 0, 6, 2]
hdbscan.approximate_predict(clusterer, blobs[:10])

I tracked the issue down to prediction data selecting the clusters from the tree differently to how it's done in _hdbscan_tree.pyx. The fix is to return the selected clusters from get_clusters in _hdbscan_tree.pyx and use the same clusters for prediction.

With this PR:

# 7 clusters from labels
clusterer.labels_.max() + 1
# 7 clusters from exemplars
len(clusterer.exemplars_)
# [5, 4, 3, 0, 5, 5, 6, 0, 5, 1]
clusterer.labels_[:10]
# predicting assigns points to correct clusters
# [5, 4, 3, 0, 5, 5, 6, 0, 5, 1]
hdbscan.approximate_predict(clusterer, blobs[:10])

This likely fixes #308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exemplars_ ordering not matching labels_ ordering
1 participant