-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converted PLSC to hierarchical #704
Conversation
The later levels seem very hard. Maybe we should limit the levels to two? |
I'm not sure whether the way I formulated the task makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
0.09242363263566819, | ||
0.08387202889701235 | ||
], | ||
"Level 3": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would cut from here and down
Disciplines are multilabel but for the added clustering tasks I chose only those cases where there is one discipline.
Yes, "scientific_fields" could be used as the first level and "disciplines" as the second. The entire dataset is available at https://huggingface.co/datasets/rafalposwiata/plsc |
@x-tabdeveloping will you add points for this then I believe it is ready to merge |
I'm not sure though. The task formulation might be wrong. I think doing "scientific_fields" as first level and "disciplines" as the second might be the way to go. |
@x-tabdeveloping but the current approach is fine with that right? As I understand it is just does the clustering at each level? |
Yes, unless the order is not fixed, and I don't know if it is (we have to check) |
Right. Once checked we can either close or merge |
Nope, it's not hierarchical at all. We can maybe rephrase it as multilabel classification if we really want to, otherwise fine to leave it as flat clustering. |
Let us leave it as flat clustering |
Checklist for adding MMTEB dataset
Reason for dataset addition:
Converted both PLSC tasks (S2S, P2P) to hierarchical clustering. #702
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).