New Benchmarking Dataset Enquiry #440

anubhav562 · 2023-07-10T12:33:16Z

Hey @erikbern and ANN Benchmarking team!

Thanks for creating and maintaining a clean and sophisticated repository for the community.

I am Anubhav, an NLP Researcher at Advanced Symbolics Inc. Me and my company work in the space of probabilistic sampling of social media data and mining people's opinions out of it (https://www.advancedsymbolics.com).

I recently went through all the 13 benchmarking datasets included in the repository and discovered that although the datasets belong to a variety of categories (Computer Vision, Recommender System, Movie Ratings etc), there was a lack of datasets from the NLP Domain (only GloVe). GloVe embeddings have been pretty old and there are better and much more sophisticated textual embeddings present in the industry now.

I would like to ask you if you would allow me to contribute a benchmarking dataset of tweets ranging from 1M - 10M. These tweets would be probabilistically sampled and would represent an actual world text/NLP dataset. The test set would contain a set of "n" queries which would be search engine queries from users (for eg. "Top SUVs under $50K ?"). We would then use the latest language models to convert all of the data to embeddings. Finally, we can run the experiments for all the indices on this dataset as well.

Why should we consider adding this dataset?
There has been a recent boom of LLMs and vector databases, and many people refer to Ann-benchmarks for it. We also referred to ANN benchmarks to decide which index to choose. But sadly there was no up to date NLP use case included in the repo.

Please let me know If this plan sounds good to you!

Thanks and Regards,
Anubhav Chhabra

anubhav562 · 2023-07-11T14:51:07Z

Hey @erikbern @maumueller @ale-f any thoughts on this? If you would accept the contributions I would start working on them right away!

maumueller · 2023-07-17T13:40:11Z

Hi @anubhav562. Could you please provide some results on the suggested dataset? I like the application, but I think it should add some diversity in terms of the performance that we see.

anubhav562 · 2023-07-17T15:59:08Z

Makes sense! I will work that out and post the results here. Thanks for your input @maumueller!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Benchmarking Dataset Enquiry #440

New Benchmarking Dataset Enquiry #440

anubhav562 commented Jul 10, 2023

anubhav562 commented Jul 11, 2023

maumueller commented Jul 17, 2023

anubhav562 commented Jul 17, 2023

New Benchmarking Dataset Enquiry #440

New Benchmarking Dataset Enquiry #440

Comments

anubhav562 commented Jul 10, 2023

anubhav562 commented Jul 11, 2023

maumueller commented Jul 17, 2023

anubhav562 commented Jul 17, 2023