DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI

News!

🎉 [AI Agent] March 18, 2024: Update xLAM for AI Agent. Check xLAM for the latest data and models relevant to AI Agent!
🎉 [Dataset Viewer]. March 17 2024: Update for dataset viewer issues on HuggingFace: Please refer to this repo for view of each dataset, where we provide 5 converted examples along with 5 original examples under each data folder. For example, ShareGPT contains two files: converted_examples.json and original_example.json.
[Upload models] Aug 18, 2023. We upload version 1.0 models (dialogstudio-t5-base-v1.0, dialogstudio-t5-large-v1.0, dialogstudio-t5-3b-v1.0) trained on a few selected DialogStudio datasets and more than 1000 general tasks.
[Version 1.0.1] Aug 1, 2023. We resolved minor issues in a few dialogues, added prompts for selected knowledge-grounded datasets, removed requirements for HuggingFace login, and made updates to SODA and ShareGPT datasets.
[Initial Release] July 2023. We're thrilled to the initial release of the largest unified Dialog dataset collection. The full list of all available datasets is here.

Introduction

DialogStudio is a large collection and unified dialog datasets. The figure below provides a summary of the general statistics associated with DialogStudio. DialogStudio unified each dataset while preserving its original information, and this aids in supporting research on both individual datasets and Large Language Model (LLM) training. The full list of all available datasets is here.

The data are downloadable through Huggingface as introduced in Loading Data. We also provide examples for each dataset in this repo. For more granular and category-specific details, please refer to the individual folders corresponding to each category within the DialogStudio collection, e.g. MULTIWOZ2_2 dataset under the task-oriented-dialogues category.

DialogStudio evaluates dialogue quality based on six critical criteria, namely Understanding, Relevance, Correctness, Coherence, Completeness, and Overall Quality. Each criterion is scored on a scale of 1 to 5, with the highest scores reserved for exceptional dialogues.

Given the vast number of datasets incorporated into DialogStudio, we utilized 'gpt-3.5-turbo' to assess 33 distinct datasets. The corresponding script used for this evaluation can be accessed through the link.

The results of our dialogue quality assessment are presented below. We intend to release evaluation scores for individually selected dialogues in the upcoming period.

Loading Data

You can load any dataset in the DialogStudio from the HuggingFace hub by claiming the {dataset_name}, which is exactly the dataset folder name. All available datasets are described in dataset content.

Below is one example to load the MULTIWOZ2_2 dataset under the task-oriented-dialogues category:

Load the dataset

from datasets import load_dataset

dataset = load_dataset('Salesforce/dialogstudio', 'MULTIWOZ2_2')

Here is the output structure of MultiWOZ 2.2

DatasetDict({
    train: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 8437
    })
    validation: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['original dialog id', 'new dialog id', 'dialog index', 'original dialog info', 'log', 'prompt', 'external knowledge non-flat', 'external knowledge', 'dst knowledge', 'intent knowledge'],
        num_rows: 1000
    })
})

Datasets

The datasets are split into several categories in this GitHub repository and HuggingFace hub. You can check the table of dataset for more information. And you can click into each folder to check a few examples:

Model

We've rolled out version 1.0 of models (dialogstudio-t5-base-v1.0, dialogstudio-t5-large-v1.0, dialogstudio-t5-3b-v1.0) trained on a few selected DialogStudio datasets. Check each Model Card for more details.

Below is one example for running model on CPU:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-base-v1.0")
model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-base-v1.0")

input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

Our project follows the following structure with respect to licensing:

For all the modified datasets in DialogStudio:
- A portion of these datasets is under the Apache License 2.0.
- Some retain their original licenses even after modification.
- For a few datasets that lacked a license, we have cited the relevant papers.
Original dataset licenses: For reference, we also put the originally available licenses for each dataset into their respective dataset folders.
Code: Our codebase is under the Apache License 2.0.

For detailed licensing information, please refer to the specific licenses accompanying the original datasets. It is important to familiarize yourself with these terms as we do not assume responsibility for licensing issues.

Acknowledgement

We sincerely thank all dataset authors who have contributed to the Conversational AI field. Despite careful efforts, inaccuracies in our citations or references may occur. If you spot any errors or omissions, please raise an issue or submit a pull request to help us improve. Thank you!

Citation

The data and code in this repository is mostly developed for or derived from the paper below. If you utilize datasets from DialogStudio, we kindly request you cite both the original work and our own work (Accepted by EACL 2024 Findings as a long paper).

@article{zhang2023dialogstudio,
  title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI},
  author={Zhang, Jianguo and Qian, Kun and Liu, Zhiwei and Heinecke, Shelby and Meng, Rui and Liu, Ye and Yu, Zhou and Savarese, Silvio and Xiong, Caiming},
  journal={arXiv preprint arXiv:2307.10172},
  year={2023}
}

Contribution

We enthusiastically invite contributions from the community! Join us in our shared mission to propel the field of conversational AI forward!

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
code		code
conversational-recommendation-dialogues		conversational-recommendation-dialogues
dialogue-summarization		dialogue-summarization
figures		figures
knowledge-grounded-dialogues		knowledge-grounded-dialogues
natural-language-understanding		natural-language-understanding
open-domain-dialogues		open-domain-dialogues
stats		stats
task-oriented-dialogues		task-oriented-dialogues
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dataset_Stats.csv		Dataset_Stats.csv
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI.pdf		DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI.pdf
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md

License

salesforce/DialogStudio

Folders and files

Latest commit

History

Repository files navigation

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI

News!

Contents

Introduction

Loading Data

Datasets

Model

License

Acknowledgement

Citation

Contribution

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages