Idea: Cohere Wikipedia Dataset #393

mmmaia · 2023-04-21T19:07:13Z

I believe the recently released Cohere's Wikipedia Embedding Archives could be a good addition to the benchmarks dataset.

It's note worth the multi language nature of the dataset.

Wikipedia	Number of vectors / embedded passages
English	35 million
German	15 million
French	13 million
Spanish	10 million
Italian	8 million
Japanese	5 million
Arabic	3 million
Chinese (Simplified)	2 million
Korean	1 million
Simple English	486 Thousand
Hindi	432 Thousand
Total	94 Million

erikbern · 2023-04-21T19:13:44Z

Good idea! Do you want to add it to https://github.com/erikbern/ann-benchmarks/blob/main/ann_benchmarks/datasets.py (for English)? I'm about to run a new round of benchmarks so we could include that as one dataset.

mmmaia · 2023-04-21T19:30:42Z

I'm pretty new to this, so would probably take some time before getting it to work 😬

I may give it a try next week, if nobody does it.

erikbern · 2023-04-21T21:43:52Z

Ok no rush, I can also take a look at it. But you're very welcome to look at it too, if I don't have time to!

Provide feedback