Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Benchmarking Dataset Enquiry #440

Open
anubhav562 opened this issue Jul 10, 2023 · 3 comments
Open

New Benchmarking Dataset Enquiry #440

anubhav562 opened this issue Jul 10, 2023 · 3 comments

Comments

@anubhav562
Copy link

Hey @erikbern and ANN Benchmarking team!

Thanks for creating and maintaining a clean and sophisticated repository for the community.

I am Anubhav, an NLP Researcher at Advanced Symbolics Inc. Me and my company work in the space of probabilistic sampling of social media data and mining people's opinions out of it (https://www.advancedsymbolics.com).

I recently went through all the 13 benchmarking datasets included in the repository and discovered that although the datasets belong to a variety of categories (Computer Vision, Recommender System, Movie Ratings etc), there was a lack of datasets from the NLP Domain (only GloVe). GloVe embeddings have been pretty old and there are better and much more sophisticated textual embeddings present in the industry now.

I would like to ask you if you would allow me to contribute a benchmarking dataset of tweets ranging from 1M - 10M. These tweets would be probabilistically sampled and would represent an actual world text/NLP dataset. The test set would contain a set of "n" queries which would be search engine queries from users (for eg. "Top SUVs under $50K ?"). We would then use the latest language models to convert all of the data to embeddings. Finally, we can run the experiments for all the indices on this dataset as well.

Why should we consider adding this dataset?
There has been a recent boom of LLMs and vector databases, and many people refer to Ann-benchmarks for it. We also referred to ANN benchmarks to decide which index to choose. But sadly there was no up to date NLP use case included in the repo.

Please let me know If this plan sounds good to you!

Thanks and Regards,
Anubhav Chhabra

@anubhav562
Copy link
Author

Hey @erikbern @maumueller @ale-f any thoughts on this? If you would accept the contributions I would start working on them right away!

@maumueller
Copy link
Collaborator

Hi @anubhav562. Could you please provide some results on the suggested dataset? I like the application, but I think it should add some diversity in terms of the performance that we see.

@anubhav562
Copy link
Author

Makes sense! I will work that out and post the results here. Thanks for your input @maumueller!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants