E2E refined dataset

This is an refined dataset of the E2E dataset.

Authors: Keisuke Toyama, Katsuhito Sudoh, Satoshi Nakamura

Description

The E2E dataset is a very popular dataset for MR-to-text. The dataset consists of a set of pairs of a British English sentence and a corresponding MR in a restaurant recommendation domain. However, some of the MR-text pairs suffer from the following errors: "deletion" (an MR is not reflected in the text), "insertion" (an MR whose value is empty appears in the text with an unintended value), and "substitution" (an MR value is replaced in the text). Since such errors affect the quality of MR-to-text systems, they must be fixed as much as possible. Therefore, we developed a refined dataset and some python programs that convert the original E2E dataset into a refined dataset.

We also provided the following additional annotations:

"MR order" (order): the order of the mentions of MR values in corresponding sentences
"Number of sentences" (num_sen): the number of sentences included in the text part
"Sentence indexes" (idx_sen): an index of sentences that include the corresponding MR values

Dataset

https://github.com/KSKTYM/E2E-refined-dataset/blob/main/release/e2e_refined_dataset_v1_0_0.zip

Python Programs

Development Environment

OS
- Ubuntu 20.04
Python
- 3.8.10

Usage

download the E2E dataset

$ ./EXE0-GET-E2E-DATASET.sh

convert csv files to json files

$ ./EXE1-CONV-CSV2JSON.sh

correct text data

$ ./EXE2-CORRECT-TXT.sh

correct MR data

$ ./EXE3-CORRECT-MR.sh

convert json files to csv files

$ ./EXE4-CONV-JSON2CSV.sh

collect the generated dataset and pack them in a zip file

$ ./EXE5-MAKE-RELEASE-PACKAGE.sh

You can execute these process with one command as

$ ./EXE-ALL.sh

Error Analysis

$ cd error_analysis
$ ./EXE-e2e-dataset.sh
$ ./EXE-cleaned-dataset.sh
$ ./EXE-enriched-dataset.sh

Citing

If you use this dataset in your work, please cite the following papers:

@inproceedings{novikova2017e2e,
  title={The {E2E} Dataset: New Challenges for End-to-End Generation},
  author={Novikova, Jekaterina and Du{\v{s}}ek, Ondrej and Rieser, Verena},
  booktitle={Proceedings of the 18th Annual Meeting of the Special Interest 
             Group on Discourse and Dialogue},
  address={Saarbr\"ucken, Germany},
  year={2017},
  note={arXiv:1706.09254},
  url={https://arxiv.org/abs/1706.09254},
}

Version

2022/07/14 version 0.8.0
2022/09/28 version 0.9.0 (prerelease version)
2022/11/01 version 1.0.0 (initial version)

License

Distributed under the Creative Common 4.0 Attribution-ShareAlike License (CC4.0-BY-SA).

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
error_analysis		error_analysis
release		release
EXE-ALL.sh		EXE-ALL.sh
EXE0-GET-E2E-DATASET.sh		EXE0-GET-E2E-DATASET.sh
EXE1-CONV-CSV2JSON.sh		EXE1-CONV-CSV2JSON.sh
EXE2-CORRECT-TXT.sh		EXE2-CORRECT-TXT.sh
EXE3-CORRECT-MR.sh		EXE3-CORRECT-MR.sh
EXE4-CONV-JSON2CSV.sh		EXE4-CONV-JSON2CSV.sh
EXE5-MAKE-RELEASE-PACKAGE.sh		EXE5-MAKE-RELEASE-PACKAGE.sh
README.md		README.md
m_conv_csv2json.py		m_conv_csv2json.py
m_conv_json2csv.py		m_conv_json2csv.py
m_correct_mr.py		m_correct_mr.py
m_correct_txt.py		m_correct_txt.py
m_make_valuelist.py		m_make_valuelist.py

KSKTYM/E2E-refined-dataset

Folders and files

Latest commit

History

Repository files navigation

E2E refined dataset

Description

Dataset

Python Programs

Development Environment

Usage

Error Analysis

Citing

Version

License

About

Resources

Stars

Watchers

Forks

Languages