Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on Utilizing UMAP for Text Similarity and Clustering #1113

Open
dsdanielpark opened this issue Apr 17, 2024 · 4 comments
Open

Inquiry on Utilizing UMAP for Text Similarity and Clustering #1113

dsdanielpark opened this issue Apr 17, 2024 · 4 comments

Comments

@dsdanielpark
Copy link

Hello,

I would like to express my sincere appreciation for your passionate communication and efficient package management. I have reviewed the documentation and code related to the use of UMAP, but find myself in need of expert advice.

My intention is to use UMAP for clustering and measuring the similarity between arrays of sentence embeddings. There are no labels associated with this data, and I have several questions about this process. Additionally, I wish to logically discuss how the text similarity results compare to the outcomes provided by UMAP.

  1. I am curious whether UMAP could guarantee better results compared to DBSCAN.
  2. In the absence of labels, do you have any advice on efficiently setting the dimensions and point distances (args) in UMAP?

Any keywords, references, or preliminary answers you could provide would be greatly appreciated.

Thank you once again for your wonderful project.

@lmcinnes
Copy link
Owner

Thank you for the kind words.

Mostly what UMAP will buy you over using DBSCAN directly on the embedding vectors is a lot more of your data clustered while still having reasonably fine-grained clusters. Can I guarantee better results? I think there are no guarantees, especially in unsupervised learning. Would I expect better results if you use UMAP first and then DBSCAN or HDBSCAN? Yes, I definitely would.

Choosing parameters is always going to come down to the data you have, the kinds of results you want to get, and what you are going to use the clustering for from there. Some rules of thumb: n_components=5 is a good starting point for clustering. It is enough dimensions that UMAP has a much easier time resolving tangles etc. in the optimization, but still pretty low. I would not choose n_components larger than n_neighbors (or really larger than 20 even if you have a very large n_neighbors). The choice of n_neighbors is going to strongly influence the granularity of the clustering. The smaller the value the more fine grained the resolution of clusters you'll tend to get out (assuming DBSCAN or HDBSCAN for clustering the UMAP output). As for metric, the usual choice for sentence embeddings is "cosine"; if you want to try something a little different then import pynndescent and use pynndescent.distances.alternative_cosine which is a small tweak on cosine distance that may work better for your use case with UMAP.

@dsdanielpark
Copy link
Author

Thank you for your kind response. I'll start as you suggested!

Can UMAP be updated in batches? Is it possible to create a UMAP model for large images and further train it? It seems impossible due to UMAP's mechanics, but I wonder if implementing this feature would be difficult.

@lmcinnes
Copy link
Owner

I think for that use case you might want to look into ParametricUMAP. UMAP does have an update method, but it is definitely not the same as training on the full dataset.

@dsdanielpark
Copy link
Author

Thank you for your response! I will check this Parametric UMAP!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants