Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For document clustering, should we leave instruction blank? #6

Open
griff4692 opened this issue Feb 23, 2024 · 3 comments
Open

For document clustering, should we leave instruction blank? #6

griff4692 opened this issue Feb 23, 2024 · 3 comments

Comments

@griff4692
Copy link

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.

Should I add an instruction or leave it blank?

Thank you,
Griffin

@Muennighoff
Copy link
Contributor

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.

Should I add an instruction or leave it blank?

Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

@griff4692
Copy link
Author

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.
Should I add an instruction or leave it blank?
Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

Thanks for the reply! Yes - in order to cluster documents for in-context pre-training (https://arxiv.org/abs/2310.10638).

Was going to try "Identify the main topics from a medical document." but wasn't sure how instructions for embeddings are meant to be worded for gritlm.

@Muennighoff
Copy link
Contributor

Thanks -- I am using Grit for document embeddings that will be used to score doc-to-doc similarity.
Should I add an instruction or leave it blank?
Thank you, Griffin

So is it like STS rather than Retrieval? I would probably add them in that case, but it may make sense to try both.

Thanks for the reply! Yes - in order to cluster documents for in-context pre-training (https://arxiv.org/abs/2310.10638).

Was going to try "Identify the main topics from a medical document." but wasn't sure how instructions for embeddings are meant to be worded for gritlm.

Yeah I think for clustering you'll get slightly better performance if you include an instruction. The one you proposed sounds good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants