Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I force approximate_predict to assign every embedding to an existing cluster? #599

Open
mirix opened this issue Jul 5, 2023 · 3 comments

Comments

@mirix
Copy link

mirix commented Jul 5, 2023

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:

		clusterable_embedding_large = umap.UMAP(
		    n_neighbors=n_neighbors,
		    min_dist=.0,
		    n_components=comp,
		    random_state=31416,
		    metric='cosine'
		).fit_transform(df_dist)

Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):

		cel_long = clusterable_embedding_large[long_seg]
		cel_shor = clusterable_embedding_large[shor_seg]

Then I cluster the long sentences only:

		clusterer = hdbscan.HDBSCAN(
		    min_samples=1,
		    min_cluster_size=cluster_size,
		    #cluster_selection_method='eom',
		    cluster_selection_method='leaf',
		    cluster_selection_epsilon=5,
		    gen_min_span_tree=True,
		    prediction_data=True
		).fit(cel_long)

Next I would like to assign each of the short sentences to one of the pre-existing clusters:

		labels = list(clusterer.labels_)
		labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
		labels_short = list(labels_short)
		
		print(labels)
                [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
		print(labels_short)
               [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]

However, I face two issues:

  1. Some points are not assigned (label -1).

  2. Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

@lmcinnes
Copy link
Collaborator

lmcinnes commented Jul 5, 2023 via email

@mirix
Copy link
Author

mirix commented Jul 5, 2023

Thanks, it seems promising. I will look into that.

In the meantime, I have found a workaround:

I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required.

This seems to solve the problem on the current dataset.

@mirix
Copy link
Author

mirix commented Jul 6, 2023

In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project:

https://github.com/mirix/approaches-to-diarisation

I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants