-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Addition of : Thai, Romanian, Hebrew, Korean, Burmese, Nigerian (Multilingual) Datasets #724
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this addition! Below my comments.
Also please rename files with underscores by removing it (e.g., Thai_Restaurant_Reviews.json -> ThaiRestaurantReviews.json)
(In the interest of quick merging, I'd recommend not adding a new dataset after significant reviewing is done in the same PR. |
I've made all changes @KranthiGV @imenelydiaker |
I have also created a jsonl file and added points...2 for @imenelydiaker and @KranthiGV for reviewing and 12 for myself (6x2). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Can you please add you name and affiliation here and run linting?
mteb/tasks/Classification/multilingual/IndicSentimentClassification.py
Outdated
Show resolved
Hide resolved
mteb/tasks/Classification/multilingual/IndicSentimentClassification.py
Outdated
Show resolved
Hide resolved
…ation.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
…ation.py Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
I've done it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for your contribution! Let's merge 🙂
Checklist for adding MMTEB dataset
Addition of :
Reason for dataset addition:
All the datasets (except Korean) are low resource datasets, and were present mostly only in the multilingual datasets. Having monolingual datasets in low resource languages will enrich the diversity of the benchmark.
mteb
package.mteb run -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.438.jsonl
).