Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Cohere Wikipedia Dataset #393

Open
mmmaia opened this issue Apr 21, 2023 · 3 comments
Open

Idea: Cohere Wikipedia Dataset #393

mmmaia opened this issue Apr 21, 2023 · 3 comments

Comments

@mmmaia
Copy link

mmmaia commented Apr 21, 2023

I believe the recently released Cohere's Wikipedia Embedding Archives could be a good addition to the benchmarks dataset.

It's note worth the multi language nature of the dataset.


Wikipedia Number of vectors / embedded passages
English 35 million
German 15 million
French 13 million
Spanish 10 million
Italian 8 million
Japanese 5 million
Arabic 3 million
Chinese (Simplified) 2 million
Korean 1 million
Simple English 486 Thousand
Hindi 432 Thousand
Total 94 Million
@erikbern
Copy link
Owner

Good idea! Do you want to add it to https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/datasets.py (for English)? I'm about to run a new round of benchmarks so we could include that as one dataset.

@mmmaia
Copy link
Author

mmmaia commented Apr 21, 2023

I'm pretty new to this, so would probably take some time before getting it to work 😬

I may give it a try next week, if nobody does it.

@erikbern
Copy link
Owner

Ok no rush, I can also take a look at it. But you're very welcome to look at it too, if I don't have time to!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants