Can I force approximate_predict to assign every embedding to an existing cluster? #599

mirix · 2023-07-05T12:16:53Z

Hello,

Let me see if I am understanding things correctly.

I am reducing dimensionality with UMAP:

		clusterable_embedding_large = umap.UMAP(
		    n_neighbors=n_neighbors,
		    min_dist=.0,
		    n_components=comp,
		    random_state=31416,
		    metric='cosine'
		).fit_transform(df_dist)

Then I split the UMAP embeddings according to predefined indexes (between long and short sentences):

		cel_long = clusterable_embedding_large[long_seg]
		cel_shor = clusterable_embedding_large[shor_seg]

Then I cluster the long sentences only:

		clusterer = hdbscan.HDBSCAN(
		    min_samples=1,
		    min_cluster_size=cluster_size,
		    #cluster_selection_method='eom',
		    cluster_selection_method='leaf',
		    cluster_selection_epsilon=5,
		    gen_min_span_tree=True,
		    prediction_data=True
		).fit(cel_long)

Next I would like to assign each of the short sentences to one of the pre-existing clusters:

		labels = list(clusterer.labels_)
		labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor)
		labels_short = list(labels_short)
		
		print(labels)
                [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1]
		print(labels_short)
               [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1]

However, I face two issues:

Some points are not assigned (label -1).
Some points are assigned to a new cluster which did not exist in the original clustering (label 2).

The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what?

On the other hand, I believe that the second issue was not possible. From the docs:

With that done you can run [approximate_predict()](https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model.

Can this be also avoided?

Best,

Ed

The text was updated successfully, but these errors were encountered:

lmcinnes · 2023-07-05T13:40:34Z

I think you want to try the soft clustering options to manage to do that.

…

On Wed, Jul 5, 2023 at 8:17 AM mirix ***@***.***> wrote: Hello, Let me see if I am understanding things correctly. I am reducing dimensionality with UMAP: clusterable_embedding_large = umap.UMAP( n_neighbors=n_neighbors, min_dist=.0, n_components=comp, random_state=31416, metric='cosine' ).fit_transform(df_dist) Then I split the UMAP embeddings according to predefined indexes (between long and short sentences): cel_long = clusterable_embedding_large[long_seg] cel_shor = clusterable_embedding_large[shor_seg] Then I cluster the long sentences only: clusterer = hdbscan.HDBSCAN( min_samples=1, min_cluster_size=cluster_size, #cluster_selection_method='eom', cluster_selection_method='leaf', cluster_selection_epsilon=5, gen_min_span_tree=True, prediction_data=True ).fit(cel_long) Next I would like to assign each of the short sentences to one of the pre-existing clusters: labels = list(clusterer.labels_) labels_short, strengths = hdbscan.approximate_predict(clusterer, cel_shor) labels_short = list(labels_short) print(labels) [0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1] print(labels_short) [1, -1, 0, 1, -1, -1, 0, -1, 0, -1, -1, 1, -1, 0, 1, 0, -1, 0, -1, -1, 0, -1, 2, 0, 0, 0, -1, 0, 0, -1, 0, 0, 0, 0, -1, -1, 0, 0, -1, -1, -1, -1] However, I face two issues: 1. Some points are not assigned (label -1). 2. Some points are assigned to a new cluster which did not exist in the original clustering (label 2). The first issue I believe I understand, but I would like to avoid it, if possible. Is it possible to force approximate_predict to assign a data point to the nearest cluster no matter what? On the other hand, I believe that the second issue was not possible. From the docs: With that done you can run [approximate_predict()]( https://hdbscan.readthedocs.io/en/latest/api.html#hdbscan.prediction.approximate_predict) with the model and any new data points you wish to predict. Note that this differs from re-running HDBSCAN with the new points added since no new clusters will be considered – instead the new points will be labelled according to the clusters already labelled by the model. Can this be also avoided? Best, Ed — Reply to this email directly, view it on GitHub <#599>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBI7MUOLSMMA4ZRFOJDXOVLMDANCNFSM6AAAAAAZ63ZLOA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

mirix · 2023-07-05T14:20:35Z

Thanks, it seems promising. I will look into that.

In the meantime, I have found a workaround:

I cluster all the points together as usual. Then, for each short sentence, I compute the average distance from each cluster (excluding short sentences) and reassign if required.

This seems to solve the problem on the current dataset.

mirix · 2023-07-06T07:07:54Z

In case your are interested, HDBSCAN works wonderfully for clustering speakers in a diarisation project:

https://github.com/mirix/approaches-to-diarisation

I am really impressed. The challenge now would be to come up with some heuristics or ML to guess the optimal parameters automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I force approximate_predict to assign every embedding to an existing cluster? #599

Can I force approximate_predict to assign every embedding to an existing cluster? #599

mirix commented Jul 5, 2023

lmcinnes commented Jul 5, 2023 via email

mirix commented Jul 5, 2023

mirix commented Jul 6, 2023

Can I force approximate_predict to assign every embedding to an existing cluster? #599

Can I force approximate_predict to assign every embedding to an existing cluster? #599

Comments

mirix commented Jul 5, 2023

lmcinnes commented Jul 5, 2023 via email

mirix commented Jul 5, 2023

mirix commented Jul 6, 2023