Skip to content

List of benchmarks to evaluate the quality of your intent matching and entity recognition chatbot components

Notifications You must be signed in to change notification settings

xatkit-bot-platform/awesome-nlp-chatbot-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

Awesome NLP benchmarks for intent-based chatbots

List of benchmarks to evaluate the quality of your intent matching and entity recognition chatbot components. Given the myriad of NLP/NLP libraries to build your own chatbot (DialogFlow, Amazon Lex, Rasa, NLP.js, Xatkit, ...), it's important to be able to have some datasets we could use to benchmark them.

To evaluate the quality of intent-matching and entity recognition components, we cannot just use raw NLP datasets. We need datasets that include:

  • The user utterance
  • The intent that should be matched given that utterance
  • The list of entities that should be identified in that utterance

Obviously, even better if the dataset already comes with different sets of data for training, testing and validation so that different authors/vendors can more precisely replicate and report the evaluation results they get when evaluating a given library.

Datasets

  • NLU Evaluation Corpora. Three corpora which can be used for evaluating chatbots or other conversational interfaces. Two of the corpora were extracted from StackExchange, one from a Telegram chatbot. For instance, these corpora have been used in this benchmark.
  • Home automation corpora. Natural language data for human-robot interaction in home domain. 25K entries. The Slurp dataset adds to this textual data the corresponding acoustic data to test voice bots.
  • Massive. A parallel dataset of > 1M utterances across 52 languages. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset mentioned above.
  • Clinc. An Evaluation Dataset for Intent Classification with a focus on testing the capabilities for Out-of-Scope predictions.
  • Kaggle dataset for intent classification and NER. There are 7 intents. Data is in JSON format where each entity is also tagged.
  • HINT3. Three new datasets created from live chatbots in diverse domains. Only intent matching data.
  • Banking77. A fine-grained set of intents in a banking domain. It comprises 13,083 customer service queries labeled with 77 intents.
  • XitXat. XitXat is a (Catalan) conversational dataset made of 950 chatbot conversations in 10 different domains.

Non-intent-based datasets

  • OpenAssistant Conversations Dataset (OASST1). A human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.

Papers

Research works discussing, proposing or comparing NLP benchmarks:

Additional Links

Contributing

Feel free to open an issue or submit a pull request with any NLP dataset for chatbots that we may be missing (thanks!).

About

List of benchmarks to evaluate the quality of your intent matching and entity recognition chatbot components

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published