ICLR Discourse Dataset

Setup

If you are in IESL and using blake.cs.umass.edu, please run module load python3/3.9.1-2102 first, without which it is possible that Stanza won't work.
Set up a Python virtual environment and download Stanza models (So now you don't have to do the CoreNLP stuff)

python3 -m venv iddve
source iddve/bin/activate
python -m pip install -r requirements/mini_requirements.txt

python -c "import stanza; stanza.download('en')"

Run code to create datasets

python build_pair_datasets.py [--debug]

Adding the --debug flag creates smaller datasets, better for viewing and testing. They are created in a folder whose name ends in _debug.

Verify the built datasets

To compare the checksums of your generated with the originals, run

python check.py [--debug]

The --debug flag checks the files created using the debug flag in step 2.

If you don't get 'OK' for all the files... uhh, for now, ask Neha what to do about it

Data format

Output files are in JSON format. You should see this file structure:

iclr-discourse-dataset
│
└─── review_rebuttal_pair_dataset/
│   │   unstructured.json
│   │   traindev_train.json
│   │   traindev_dev.json
│   │   traindev_test.json
│   │   truetest.json
|
└─── review_rebuttal_pair_dataset_debug/ # if you ran with --debug as well
│   │   unstructured.json # These files will be much smaller
│   │   traindev_train.json
│   │   traindev_dev.json
│   │   traindev_test.json
│   │   truetest.json
│   
|   ... other ...
│   ... stuff ...

Unstructured: unstructured text from reviews, rebuttals and abstracts, in ICLR 2018, for use in domain pre-training à la Don't Stop Pretraining
Truetest: (20% of all) review-rebuttal pairs from ICLR 2020, to be used as an unseen test set
Traindev: review-rebuttal pairs from ICLR 2019 in a traditional train/dev/test split. (3:1:1)

Each file has the following fields:

conference: Which ICLR conference the examples are drawn from
split: which split this data is from, out of unstructured/traindev/truetest
subsplit: train, dev, or test
review_rebuttal_pairs: a list of review-rebuttal pairs

Each review-rebuttal pair has the following fields:

index: index within dataset
review_sid: 'super id' (id of first comment) in review
rebuttal_sid: 'super id' (id of first comment) in rebuttal
review_text: review text in chunks
rebuttal_text: rebuttal text in chunks
title: paper title
review_author: id of the reviewer, e.g. "AnonReviewer1"
forum: unique 'forum' id from OpenReview API -- identifies the paper
labels: categorical labels where available, e.g. review rating, reviewer confidence

Text is represented as a list of list of lists:

The top level lists represent chunks (‘paragraphs’ separated by newlines). Each chunk is a list of sentences, and each sentence is a list of tokens. Sentence splitting and tokenizing is carried out by the CoreNLP pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
baselines		baselines
ir_baseline		ir_baseline
requirements		requirements
splits		splits
.gitignore		.gitignore
README.md		README.md
build_pair_datasets.py		build_pair_datasets.py
check.py		check.py
example.py		example.py
make_templates.py		make_templates.py
mini_requirements.txt		mini_requirements.txt
openreview_lib.py		openreview_lib.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

baselines

baselines

ir_baseline

ir_baseline

requirements

requirements

splits

splits

.gitignore

.gitignore

README.md

README.md

build_pair_datasets.py

build_pair_datasets.py

check.py

check.py

example.py

example.py

make_templates.py

make_templates.py

mini_requirements.txt

mini_requirements.txt

openreview_lib.py

openreview_lib.py

Repository files navigation

ICLR Discourse Dataset

Setup

Data format

About

Releases

Packages

Languages

nnkennard/iclr-discourse-dataset

Folders and files

Latest commit

History

Repository files navigation

ICLR Discourse Dataset

Setup

Data format

About

Resources

Stars

Watchers

Forks

Languages