Skip to content

jianzhnie/awesome-instruction-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Instruction Datasets

Awesome

中文 | English

Contents

Introduction

"Welcome to 'awesome-prompt-datasets', a comprehensive collection of high-quality open-source instruction tuning datasets to train chat-based LLMs (ChatGPT,LLaMA,Alpaca)。

Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.

With 'awesome-prompt-dataset', you can accelerate your research and development in NLP and unlock new opportunities for innovation. Let's explore the possibilities together!"

Prompt Datasets

Referring to this (@yaodongC), we labeled each collected dataset according to the following rules:

(Lang)Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Statistics

Project Datasets Org Nums Lang Task Gen Type Src Url
Chain of Thought cot_data |few_shot_data Google 74771 EN/CN MT HG instruct with cot reasoning annotating CoT on existing data download
GPT4all nomic-ai/gpt4all-j-prompt-generations nomic-ai 806199 EN MT COL code, storys and dialogs distillation from GPT-3.5-turbo download
GPTeacher GPT-4 General-Instruct |Roleplay-Instruct |Code-Instruct | Toolformer teknium1 29013 EN MT SI general, roleplay, toolformer GPT-4 & toolformer download
Guanaco JosephusCheung/GuanacoDataset JosephusCheung 534610 ML MT SI various linguistic tasks text-davinci-003 download
HC3 Hello-SimpleAI/HC3 Hello-SimpleAI | 万得资讯 37175 EN/CN TS MIX dialogue evaluation human or ChatGPT download
HC3-Chinese Hello-SimpleAI/HC3-Chinese Hello-SimpleAI|万得资讯 13k CN TS MIX dialogue evaluation human or ChatGPT
alpaca tatsu-lab/alpaca tatsu-lab 52002 EN MT SI general instruct text-davinci-003 download
AlpacaDataCleaned yahma/alpaca-cleaned yahma 52k EN MT SI general instruct text-davinci-003 download
Chinese-LLaMA-Alpaca alpaca_data_zh_51k ymcui(讯飞) 51k CN MT SI general instruct text-davinci-003
Luotuo-Chinese-LLM 骆驼 trans_chinese_alpaca_data LC1332(商汤) 52k CN MT SI general instruct text-davinci-003
Natural Instructions Allen AI 61 task|1.5k task Allen AI 5040134 ML MT COL diverse nlp tasks human annotated datasets collection download
belle_cn BelleGroup/train_1M_CN |BelleGroup/train_0.5M_CN BelleGroup(链家) 1079517 CN TS/MT SI general, mathematical reasoning, dialogue text-davinci-003 download
instinwild instinwild_ch | instinwild_en 52191 EN/CN MT SI generation, open-qa, mind-storm text-davinci-003 download
华驼(HuaTuo) 中文医学知识 |肝癌 SCIR-HI(哈工大) 8K CN TS SI 公开和自建的中文医学知识库 GPT3.5
prosocial dialog allenai/prosocial-dialog allenai 165681 EN TS MIX dialogue GPT-3 rewrites questions + humans feedback manually download
finance_en gbharti/finance-alpaca 68912 EN TS COL financial related qa GPT3.5 download
xP3 bigscience/xP3 bigscience 78883588 ML MT COL a collection of prompts & datasets across 46 of languages & 16 NLP tasks human annotated datasets collection download
firefly YeungNLP/firefly-train-1.1M 1649398 CN MT COL 23 nlp tasks human annotated datasets collection download
instruct swype/instruct 888969 EN MT COL augmented of GPT4All, Alpaca, open-source Meta datasets augmentation performed using the advanced NLP tools provided by AllenAI download
Code Alpaca sahil280114/codealpaca 20022 EN TS SI code generation, editing, optimization text-davinci-003 download
Alpaca_GPT4 alpaca_gpt4_data|alpaca_gpt4_data_zh |comparison_data_v2 微软 52002 EN/CN MT SI general instruct generated by GPT-4 using Alpaca download
webGPT openai/webgpt_comparisons openai 18994 EN TS MIX information retrieval (IR) QA fine-tuned GPT-3, each instruction has two outputs, select better one download
dolly 2.0 databricks/databricks-dolly-15k databricks 15015 EN TS HG closed QA , summarization and etc, Wikipedia as references human annotated download
mosaicml/llm-foundry mosaicml/dolly_hhrlhf mosaicml 59.3K EN TS HG This dataset is a combination of Databrick's dolly-15k dataset and a filtered subset of Anthropic's HH-RLHF. human annotated
baize 白泽 alpaca_chat_data.json |medical_chat_data.json | quora_chat_data.json |stackoverflow_chat_data.json project-baize 653699 EN MT COL a collection from Alpaca, Quora, StackOverFlow and MedQuAD questions human annotated datasets collection download
hh-rlhf Anthropic/hh-rlhf Anthropic 284517 EN TS MIX dialogue dialog between human and RLHF models download
OIG(part) laion/OIG laion 49237 EN MT COL created from various tasks, such as question and answering using data augmentation, human annotated datasets collection download
GAOKAO Fill-in-the-blank_Questions | Multiple-choice_Questions | Open-ended_Questions OpenLMLab 2785 CN MT COL Multiple-choice, Fill-in-the-blank and Open-ended questions from examination human annotated download
camel | 骆驼 camel-ai/code|camel-ai/biology |camel-ai/physics |camel-ai/chemistry |camel-ai/math camel-ai 760620 EN MT SI Role-Playing conversations in AI Society, Code, Math, Physics, Chemistry, Biolog gpt-3.5-turbo download
FLAN-Muffin Muennighoff/flan 1764800 EN MT COL 60 nlp tasks human annotated datasets collection download
COIG COIG BAAI|智源 298428 CN MT COL collect fron Exam, Translated, Human Value Alignment Instructions and Counterfactural Correction Multi-round Chat using automatic tool and manual verification download
GPT4Tools gpt4tools_71k.json StevenGrove 71446 EN MT SI a collection of tool-related instructions gpt-3.5-turbo download
ShareChat RyokoAI/ShareGPT52K RyokoAI 1663241 EN MT MIX general instruct crowdsourcing to collect conversations between people and ChatGPT (ShareGPT) download
Auto CoT kojima-takeshi188/zero_shot_cot/dataset |kojima-takeshi188/zero_shot_cot/log amazon-science EN download
MOSS(复旦 Moss) fnlp/moss-002-sft-data| moss-003-sft-data fnlp 1583595 EN/CN SI download
ultrachat stingning/ultrachat thnlp 28247446 EN download
StackLLaMA lvwerra/stack-exchange-paired todo EN HG
Self-Instruct yizhongw/self-instruct 82 K EN SI SI
Zhihu-KOL Zhihu-KOL Openassisent 100 w SI HG Zhihu data for training Open Assitant
stanfordnlp/SHP stanfordnlp/SHP stanfordnlp 385 k EN MT HG human preferences over responses
LAION-AI/Open-Assistant OpenAssistant/oasst1 Openassisent 84.4k EN MT HG OpenAssistant Conversations Dataset (OASST1) human-generated, human-annotated
akoksal/LongForm akoksal/LongForm akoksal/LongForm 30k EN SI HG 们从现有语料库(如 C4 和维基百科)中选择一组不同的人工文档,并通过 LLM 为给定的文档生成指令。
sail-sg/symbolic-instruction-tuning sail/symbolic-instruction-tuning sail-sg 800K ML SI Human Synthetic Examples
医疗问答 michael-wzhu/PromptCBLUE michaelwzhu/ChatMed_Consult_Dataset michael-wzhu 110113 CN SI 互联网上的医疗问诊问题(110,113),反映了真实世界的不同用户/患者的医疗问诊需求。目前response都是由OpenAI GPT-3.5引擎回答的。
mbzuai-nlp/LaMini-LM MBZUAI/LaMini-instruction MBZUAI/LaMini-instruction 2.58M EN MT SI 通过离线蒸馏从大型语言模型中提取知识
pCLUE pCLUE 120 万
WizardLM victor123/evol_instruct_70k WizardLM 70k EN MT

RLHF Datasets

Statistics

Project Links Org Nums Lang Summary
webgpt_comparisons Openai 19,578 English In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.
SHP stanfordnlp 349 K English SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).
rlhf-reward-datasets yitingxie 76.3 k English
Dahoas/full-hh-rlhf Dahoas 112 k English Anthropic's HH dataset reformatted into prompt, chosen, rejected samples.
Dahoas/synthetic-instruct-gptj-pairwise Dahoas English
Dahoas/rm-static Dahoas 76.3k English Split of hh-static used for training reward models after supervised fine-tuning.
Anthropic/hh-rlhf Anthropic 22k English This RLHF dataset is an iterated 'online' dataset that includes data from 52B language models. It contains 22k helpfulness comparisons and no red-teaming data.
Instruction-Tuning-with-GPT-4/GPT-4-LLM Instruction-Tuning-with-GPT-4 52k English Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
thu-coai/Safety-Prompts thu-coai/Safety-Prompts thu-coai 100k Chinese 中文安全prompts,用于评测和提升大模型的安全性,将模型的输出与人类的价值观对齐。
Chatgpt-Comparison-Detection project Hello-SimpleAI/HC3 24.3K English Human ChatGPT Comparison Corpus, 60k human answers and 27K ChatGPT answers for around 24K questions.

Open ChatLLMs

Release Model_name Base Model_Size Datasets Number of Instances Language
2022-12 GPT-3 Self Inst. GPT-3 175B Self-Instruct 82 k En
2023-03-03 alpaca LLaMA 7B alpaca_data 52 k En
2023-03-19 alpaca-lora LLaMA 7B 13B 30B alpaca_dataalpaca_data_cleaned 52 k En
2023-03-23 Chinese-Vicuna LLaMA 7B 13B BELLEGuanacoDataset 1M Zh
2023-03-24 Alpaca-CoT LLaMA 7B dataset ---- En Zh
2023-03-25 dolly dolly 6B alpaca_data 52 k En
2023-03-25 guanaco LLaMA 7B GuanacoDataset 534 k En Zh Ja De
2023-03-28 Chinese-LLaMA-Alpaca LLaMA 7B alpaca_data_zhpCLUEtranslation2019zhalpaca_data、Self-Instruct 2M Zh
2023-03-29 ColossalChat LLaMA 7B 13B InstructionWild 104 k En Zh
2023-03-31 Luotuo LLaMA ChatGLM 7B 6B trans_chinese_alpaca_data 52k Zh
2023-03-31 cerebras-lora-alpaca Cerebras-GPT 2.7B AlpacaDataCleaned 52k En

The template

Append the new project at the end of file

[{Project-name}/{Dataset-name}]{https://github.com/link/to/project}

- [paper/project link](link)
- [dataset link](link)
- Related work: (if applicable)

Some introductions ...

The Prompt Datasets List

The Alpaca of the Stanford release is a fine-tuning model for instruct-tuning based on the Meta Ai LLaMA model.

Alpaca automatically generated 52k instruction data using GPT-3.5 and used it to fine-tune the LLaMA model. Experimental results show that it can reach or even exceed the performance of GPT-3.5 on some tasks.

Instruction Tuning is a key component of ChatGPT. OpenAI used their user-based Instruction dataset, but unfortunately, this dataset is not open-sourced. Self-Instruct released a small instruction dataset including 175 instructions written by human labors. Standford Alpaca Team generated 52K instructions by text-davinci-003 model based on the the 175 seed instructions above.

This project targets on a larger and more diverse instruction dataset. To this end, we collected 429 instructions from ChatGPT usage screenshots and released both English and Chinese versions. We found these instructions are very diverse even if the scale is still small. We follow Alpaca to generate 52K instructions and their responses. All data can be found in data dir.

Note: This is an ongoing project. We are still collecting and improving our data. We release this dataset as early as possible to speedup our LLM research. We will also release a whitepaper soon.

  • Data generation model: text-davinci-003
  • Cost: $6000

52K instruction data generated from modified self-instruct pipeline with human written 429 seed task.

SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. The preferences are meant to reflect the helpfulness of one response over another, and are intended to be used for training RLHF reward models and NLG evaluation models (e.g., SteamSHP).

Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively). SHP exploits the fact that if comment A was written after comment B but has a higher score nonetheless, then A is ostensibly more preferred to B. If A had been written before B, then we could not conclude this, since its higher score could have been the result of more visibility. We chose data where the preference label is intended to reflect which response is more helpful rather than which is less harmful, the latter being the focus of much past work.

How is SHP different from Anthropic's HH-RLHF dataset? Most notably, all the data in SHP is naturally occurring and human-written, whereas the responses in HH-RLHF are machine-written, giving us two very different distributions that can complement each other.

  • Summary: A datset for Chain-of-Thoughts reasoning based on LLaMA and Alpaca. Note: Their repository will continuously collect various instruction tuning datasets. Github Repo
  • paper: N/A
  • Cost: N/A
  • Summary: A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer
  • Data generation model: GPT-4
  • paper: N/A
  • Cost: N/A
  • Summary: UltraChat aims to construct an open-source, large-scale, and multi-round dialogue data. The first part of UltraChat (i.e., the Questions about the World sector) is released, which contains 280k diverse and informative dialogues. More dialogues about writing and creation, assistance on existing materials are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Summary: Based on the Stanford Alpaca data, ChatAlpaca extends the data to multi-turn instructions and their corresponding responses. More data (20k) and the Chinese translated version are to come.
  • Data generation model: GPT-3.5-turbo
  • paper: N/A
  • Cost: N/A
  • Related: (tatsu-lab/Alpaca)|52K|EN|MT|SI
  • Summary: Chinese datasets of 23 tasks combined with human-written instruction templates.
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A
  • Summary: This datset was generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.
  • Data generation model: N/A
  • paper: Free Dolly
  • Cost: N/A
  • Summary: OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings.
  • Data generation model: N/A
  • paper: OpenAssistant Conversations - Democratizing Large Language Model Alignment
  • Cost: N/A

BELLE/data/1.5M

alpaca_chinese_dataset

Med-ChatGLM/data

  • 下载地址: https://github.com/SCIR-HI/Med-ChatGLM
  • 数据量: 7k
  • 生成方式: 利用GPT3.5接口围绕医学知识库构建问答数据,并设置了多种Prompt形式来充分利用知识
  • 涉及任务: 医学领域相关的问答,包含并发症,高危因素,组织学检查,临床症状,药物治疗,辅助治疗

pCLUE

COIG

https://github.com/FreedomIntelligence/InstructionZoo

https://github.com/lightaime/camel

The RLHF Datasets List

  • Summary: Each example is a Reddit post with a question/instruction and a pair of top-level comments for that post, where one comment is more preferred by Reddit users (collectively).
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A
  • Summary: Ranked responses (Note: Data is evaluated by GPT-4 model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"
  • Data generation model: GPT-4
  • paper: Instruction Tuning with GPT-4
  • Cost: N/A
  • Related: -(tatsu-lab/Alpaca)|52K|EN|MT|SI

Natural Instruction / Super-Natural Instruction

Allen AI is the first organization to try Instruction as a prompt and fine-tune LLMs. In the Natural Instruction paper, you can basically understand the labeling ideas of the instruction.

In its proposed dataset, 61 and different NLP tasks are included.

Super-Natural Instruction is a super-intensive version of Natural Instruction, which contains more than 1,600 different NLP tasks, and there are more than 76 different types of NLP tasks (such as: classification, extraction, sequence labeling).

BigScience is jointly organized by Hugging Face and French CNRS, IDRIS, GENCI, etc. It is one of the largest open source LLMs organizations.

BigScience developed the PromptSource project at the end of 2021, and open sourced a series of toolkits to help researchers build prompts based on existing NLP tasks. So far, the PromptSource project contains more than 2000 prompt templates for 270 NLP tasks.

On this basis, BigScience constructed the P3 dataset. You can find P3 data on Hugging Face Hub, and the data size of P3 is between 100M-1B.

xMTF - BigScience

Based on the English prompt, BigScience extends its prompt to multiple non-English languages.

The project contains 13 NLP tasks and is available in 46 different languages. The corresponding prompt contains an indeterminate number of languages.

After fine-tuning on the basis of multilingual, both BLOOM and T0 have realized the ideal multilingual ability.

HH-RLHF - Anthropic

Claud under Anthropic is one of the main competitors of ChatGPT.

Anthropic has open-sourced the RLHF dataset it uses in its own product line.

The original intention of the HH-RLHF project is to train Helpful and Harmless (HH) LLMs. Therefore, in addition to the quality of the project's responses, whether it is harmful information is also reflected in its human feedback.

The paper records how to use the behavior of the RLHF data Align model to human values, and records the construction method and standards of the data set.

Using LLMs to independently generate instruction data is an active direction in the field of instruction-tuning.

Unnatural Instruction uses GPT3 (text-davinci-002) to generate 64k instruction prompt data. And use the same model to rewrite the 64k prompt, and finally get 240k instruction data.

The paper shows that the prompts generated by LLMs in Instruct-Tuning show good results, even surpassing models such as T0 that are fine-tuned on P3 and other data.

Self-Instruct is also the idea of using LLMs to generate prompts for instruction-tuning. However, a more fine-grained generation process is used.

Concepts such as Task pool and Quality filtering were introduced to partially alleviate the noise problem of self-intrauct type data.

UnifiedSKG has added knowledge grounding in the Text-to-Text framework, that is, in the prompt-output framework, it has added structured data for assistance.

As an example, some NLP tasks rely heavily on structured knowledge bases/databases. The idea of UnifiedSKG is to serialize the required database and embed it into the prompt. As shown below.

UnifiedSKG represents a direction in the field of LLMs that attempts to use structured knowledge to enhance performance.

In this project, Google merged its own Flan 2021 data with some open source instruction data (P3, super-natural instruction, etc.).

In Flan Collection's paper, Google also summarizes some key points in Flan series model training/reasoning, which may have good reference value.

The Flan Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions, along with dozens more datasets into one place, formats them into a mix of zero-shot, few-shot and chain-of-thought templates

InstructDial

InstructDial is an attempt to fine-tune instructions on a specific task type. Experimental results show that after fine-tuning on dialogue instruction data, the model performs better on dialogue tasks than on very large-scale task sets.

ChatGPT Distillation Data

Public User-Shared Dialogues with ChatGPT (ShareGPT) Around 60K dialogues shared by users on ShareGPT were collected using public APIs. To maintain data quality, we deduplicated on the user-query level and removed any non-English conversations. This leaves approximately 30K examples.

Human ChatGPT Comparison Corpus (HC3) We use both the human and ChatGPT responses from the HC3 english dataset, which contains around 60K human answers and 27K ChatGPT answers for around 24K questions, resulting in a total number of around 87K question-answer examples.

Open Instruction Generalist (OIG).

We use a manually-selected subset of components from the Open Instruction Generalist dataset curated by LAION. Specifically, we use the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. This results in a total of around 30k examples.

OpenAI WebGPT.

In the WebGPT paper, the authors trained a reward model from human feedback. They used the reward model to train a long form question answering model to align with human preferences. This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. There are 19,578 comparisons in total.

Each example in the dataset contains a pair of model answers for a question, and the associated metadata. Each answer has a preference score from humans that can be used to determine which of the two answers are better.

OpenAI Summarization.

The OpenAI summarization dataset contains ~93K examples, each example consists of feedback from humans regarding the summarizations generated by a model. Human evaluators chose the superior summary from two options.

Datasets without license information

  • Summary: A compilation of tatsu-lab/alpaca ,Dahoas/instruct-human-assistant-prompt ,allenai/prosocial-dialog
  • Data generation model: N/A
  • paper: N/A
  • Cost: N/A

Contributing

Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.

License

Awesome-Prompt-Dataset is released under the Apache 2.0 license.

Reference

About

A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published