GitHub - sileod/tasksource: Datasets collection and preprocessings framework for NLP extreme multitask learning

tasksource 600+ curated datasets and preprocessings for instant and interchangeable use

Huggingface Datasets is an excellent library, but it lacks standardization, and datasets often require preprocessing work to be used interchangeably. tasksource streamlines interchangeable datasets usage to scale evaluation or multi-task learning.

Each dataset is standardized to a MultipleChoice, Classification, or TokenClassification template with canonical fields. We focus on discriminative tasks (= with negative examples or classes) for our annotations but also provide a SequenceToSequence template. All implemented preprocessings are in tasks.py or tasks.md. A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.

Installation and usage:

pip install tasksource

from tasksource import list_tasks, load_task
df = list_tasks(multilingual=False) # takes some time

for id in df[df.task_type=="MultipleChoice"].id:
    dataset = load_task(id) # all yielded datasets can be used interchangeably

Browse the 500+ curated tasks in tasks.md (200+ MultipleChoice tasks, 200+ Classification tasks), and feel free to request a new task. Datasets are downloaded to $HF_DATASETS_CACHE (like any Hugging Face dataset), so ensure you have more than 100GB of space available.

You can now also use:

load_dataset("tasksource/data", "glue/rte",max_rows=30_000)

Pretrained models:

Text encoder pretrained on tasksource reached state-of-the-art results: 🤗/deberta-v3-base-tasksource-nli

Tasksource pretraining is notably helpful for RLHF reward modeling or any kind of classification, including zero-shot. You can also find a large and a multilingual version.

tasksource-instruct

The repo also contains some recasting code to convert tasksource datasets to instructions, providing one of the richest instruction-tuning datasets: 🤗/tasksource-instruct-v0

tasksource-label-nli

We also recast all classification tasks as natural language inference, to improve entailment-based zero-shot classification detection: 🤗/zero-shot-label-nli

Write and use custom preprocessings

from tasksource import MultipleChoice

codah = MultipleChoice('question_propmt',choices_list='candidate_answers',
    labels='correct_answer_idx',
    dataset_name='codah', config_name='codah')
    
winogrande = MultipleChoice('sentence',['option1','option2'],'answer',
    dataset_name='winogrande',config_name='winogrande_xl',
    splits=['train','validation',None]) # test labels are not usable
    
tasks = [winogrande.load(), codah.load()]) #  Aligned datasets (same columns) can be used interchangably

Citation and contact

For more details, refer to this article:

@article{sileo2023tasksource,
  title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
  author={Sileo, Damien},
  url= {https://arxiv.org/abs/2301.05948},
  journal={arXiv preprint arXiv:2301.05948},
  year={2023}
}

For help integrating tasksource into your experiments, please contact damien.sileo@inria.fr.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.github		.github
src/tasksource		src/tasksource
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
mtasks.md		mtasks.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
tasks.md		tasks.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

src/tasksource

src/tasksource

.gitignore

.gitignore

CITATION.cff

CITATION.cff

LICENSE

LICENSE

README.md

README.md

mtasks.md

mtasks.md

pyproject.toml

pyproject.toml

setup.cfg

setup.cfg

tasks.md

tasks.md

Repository files navigation

tasksource 600+ curated datasets and preprocessings for instant and interchangeable use

Installation and usage:

Pretrained models:

tasksource-instruct

tasksource-label-nli

Write and use custom preprocessings

Citation and contact

About

Releases 43

Languages

License

sileod/tasksource

Folders and files

Latest commit

History

Repository files navigation

tasksource 600+ curated datasets and preprocessings for instant and interchangeable use

Installation and usage:

Pretrained models:

tasksource-instruct

tasksource-label-nli

Write and use custom preprocessings

Citation and contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages