Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SWIM-IR #617

Open
rasdani opened this issue May 2, 2024 · 5 comments
Open

Add SWIM-IR #617

rasdani opened this issue May 2, 2024 · 5 comments

Comments

@rasdani
Copy link
Contributor

rasdani commented May 2, 2024

Google released a new crosslingual retrieval dataset:
https://huggingface.co/datasets/nthakur/swim-ir-cross-lingual

We could turn a subset of this into a retrieval and reranking benchmark.

If no one picks this up, I can take at look at this during the weekend.

@isaac-chung
Copy link
Collaborator

Amazing. Feel free to open a PR :)

@Muennighoff
Copy link
Contributor

That'd be great indeed cc @thakur-nandan

@thakur-nandan
Copy link

Thanks @Muennighoff. The SWIM-IR dataset would be great and contains training splits only as it should be used for training. If that would be desirable we can go ahead and add it into MTEB.

Let me know if you need help @rasdani.

Thanks,
Nandan

@Muennighoff
Copy link
Contributor

Thanks @Muennighoff. The SWIM-IR dataset would be great and contains training splits only as it should be used for training. If that would be desirable we can go ahead and add it into MTEB.

Let me know if you need help @rasdani.

Thanks, Nandan

Oh does it still make sense to use it for evaluation or not at all? Not sure if adding a training dataset makes sense cc @KennethEnevoldsen

@KennethEnevoldsen
Copy link
Contributor

I wouldn't add a dataset intended for training unless we expect it to evaluate an aspect which we are currently not evaluating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants