-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Convert Reddit cluster s2s and p2p to fast #729
Conversation
When subsampling for the p2p dataset, there are a few labels that only has 1 count. So we get the following error:
Should we exclude them before subsampling? Or force include them somehow? |
Clusters with only one label? |
Yes. You should be able to see this error when you check out the PR and run
|
Hmm, that seems weird. This should be there before conversion as well I assume? It might be that the cluster sets does not have the same labels across the sets. |
Yeah. But before conversion, we do not use subsampling. The following labels only have 1 sample in the clusters after combining all rows:
|
We should probably filter out all labels with less than N. I assume N=1-5 is reasonable? |
Went with N=1. |
Checklist for adding MMTEB dataset
Resolve #728
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).