GitHub - dmlls/cannot-dataset: CANNOT: Compilation of ANnotated, Negation-Oriented Text-pairs

Compilation of ANnotated, Negation-Oriented Text-pairs

Introduction

CANNOT is a dataset that focuses on negated textual pairs. It currently contains 77,376 samples, of which roughly of them are negated pairs of sentences, and the other half are not (they are paraphrased versions of each other).

The most frequent negation that appears in the dataset is verbal negation (e.g., will → won't), although it also contains pairs with antonyms (cold → hot).

Format

The dataset is given as a .tsv file with the following structure:

premise	hypothesis	label
A sentence.	An equivalent, non-negated sentence (paraphrased).	0
A sentence.	The sentence negated.	1

The dataset can be easily loaded into a Pandas DataFrame by running:

import pandas as pd

dataset = pd.read_csv('negation_dataset_v1.0.tsv', sep='\t')

Construction

The dataset has been created by cleaning up and merging the following datasets:

Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation (see datasets/nan-nli).
GLUE Diagnostic Dataset (see datasets/glue-diagnostic).
Automated Fact-Checking of Claims from Wikipedia (see datasets/wikifactcheck-english).
From Group to Individual Labels Using Deep Features (see datasets/sentiment-labelled-sentences). In this case, the negated sentences were obtained by using the Python module negate.
It Is Not Easy To Detect Paraphrases: Analysing Semantic Similarity With Antonyms and Negation Using the New SemAntoNeg Benchmark (see datasets/antonym-substitution).

Once processed, the number of remaining samples in each of the datasets above are:

Dataset	Samples
Not another Negation Benchmark	118
GLUE Diagnostic Dataset	154
Automated Fact-Checking of Claims from Wikipedia	14,970
From Group to Individual Labels Using Deep Features	2,110
It Is Not Easy To Detect Paraphrases	8,597
Total	25,949

Additionally, for each of the negated samples, another pair of non-negated sentences has been added by paraphrasing them with the pre-trained model 🤗tuner007/pegasus_paraphrase.

Finally, the swapped version of each pair (premise ⇋ hypothesis) has also been included, and any duplicates have been removed.

With this, the number of premises/hypothesis in the CANNOT dataset that appear in the original datasets are:

Dataset	Sentences
Not another Negation Benchmark	552 (0.36 %)
GLUE Diagnostic Dataset	586 (0.38 %)
Automated Fact-Checking of Claims from Wikipedia	89,728 (59.98 %)
From Group to Individual Labels Using Deep Features	12,626 (8.16 %)
It Is Not Easy To Detect Paraphrases	17,198 (11.11 %)
Total	120,690 (77.99 %)

The percentages above are in relation to the total number of premises and hypothesis in the CANNOT dataset. The remaining 22.01 % (34,062 sentences) are the novel premises/hypothesis added through paraphrase and rule-based negation.

Contributions

Questions? Bugs...? Then feel free to open a new issue.

Acknowledgments

We thank all the previous authors that have made this dataset possible:

Thinh Hung Truong, Yulia Otmakhova, Timothy Baldwin, Trevor Cohn, Jey Han Lau, Karin Verspoor, Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, Aalok Sathe, Salar Ather, Tuan Manh Le, Nathan Perry, Joonsuk Park, Dimitrios Kotzias, Misha Denil, Nando De Freitas, Padhraic Smyth, Teemu Vahtola, Mathias Creutz, and Jörg Tiedemann.

License

The CANNOT dataset is released under CC BY-SA 4.0.

Citation

@misc{anschütz2023correct,
      title={This is not correct! Negation-aware Evaluation of Language Generation Systems}, 
      author={Miriam Anschütz and Diego Miguel Lozano and Georg Groh},
      year={2023},
      eprint={2307.13989},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
cannot-dataset		cannot-dataset
datasets		datasets
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot-dataset

cannot-dataset

datasets

datasets

src

src

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Compilation of ANnotated, Negation-Oriented Text-pairs

Introduction

Format

Construction

Contributions

Acknowledgments

License

Citation

About

Releases 2

Packages

Contributors 3

Languages

dmlls/cannot-dataset

Folders and files

Latest commit

History

Repository files navigation

Compilation of ANnotated, Negation-Oriented Text-pairs

Introduction

Format

Construction

Contributions

Acknowledgments

License

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages