You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for creating and maintaining a clean and sophisticated repository for the community.
I am Anubhav, an NLP Researcher at Advanced Symbolics Inc. Me and my company work in the space of probabilistic sampling of social media data and mining people's opinions out of it (https://www.advancedsymbolics.com).
I recently went through all the 13 benchmarking datasets included in the repository and discovered that although the datasets belong to a variety of categories (Computer Vision, Recommender System, Movie Ratings etc), there was a lack of datasets from the NLP Domain (only GloVe). GloVe embeddings have been pretty old and there are better and much more sophisticated textual embeddings present in the industry now.
I would like to ask you if you would allow me to contribute a benchmarking dataset of tweets ranging from 1M - 10M. These tweets would be probabilistically sampled and would represent an actual world text/NLP dataset. The test set would contain a set of "n" queries which would be search engine queries from users (for eg. "Top SUVs under $50K ?"). We would then use the latest language models to convert all of the data to embeddings. Finally, we can run the experiments for all the indices on this dataset as well.
Why should we consider adding this dataset?
There has been a recent boom of LLMs and vector databases, and many people refer to Ann-benchmarks for it. We also referred to ANN benchmarks to decide which index to choose. But sadly there was no up to date NLP use case included in the repo.
Please let me know If this plan sounds good to you!
Thanks and Regards,
Anubhav Chhabra
The text was updated successfully, but these errors were encountered:
Hi @anubhav562. Could you please provide some results on the suggested dataset? I like the application, but I think it should add some diversity in terms of the performance that we see.
Hey @erikbern and ANN Benchmarking team!
Thanks for creating and maintaining a clean and sophisticated repository for the community.
I am Anubhav, an NLP Researcher at Advanced Symbolics Inc. Me and my company work in the space of probabilistic sampling of social media data and mining people's opinions out of it (https://www.advancedsymbolics.com).
I recently went through all the 13 benchmarking datasets included in the repository and discovered that although the datasets belong to a variety of categories (Computer Vision, Recommender System, Movie Ratings etc), there was a lack of datasets from the NLP Domain (only GloVe). GloVe embeddings have been pretty old and there are better and much more sophisticated textual embeddings present in the industry now.
I would like to ask you if you would allow me to contribute a benchmarking dataset of tweets ranging from 1M - 10M. These tweets would be probabilistically sampled and would represent an actual world text/NLP dataset. The test set would contain a set of "n" queries which would be search engine queries from users (for eg. "Top SUVs under $50K ?"). We would then use the latest language models to convert all of the data to embeddings. Finally, we can run the experiments for all the indices on this dataset as well.
Why should we consider adding this dataset?
There has been a recent boom of LLMs and vector databases, and many people refer to Ann-benchmarks for it. We also referred to ANN benchmarks to decide which index to choose. But sadly there was no up to date NLP use case included in the repo.
Please let me know If this plan sounds good to you!
Thanks and Regards,
Anubhav Chhabra
The text was updated successfully, but these errors were encountered: