-
If you are in IESL and using blake.cs.umass.edu, please run
module load python3/3.9.1-2102
first, without which it is possible that Stanza won't work. -
Set up a Python virtual environment and download Stanza models (So now you don't have to do the CoreNLP stuff)
python3 -m venv iddve
source iddve/bin/activate
python -m pip install -r requirements/mini_requirements.txt
python -c "import stanza; stanza.download('en')"
- Run code to create datasets
python build_pair_datasets.py [--debug]
Adding the --debug
flag creates smaller datasets, better for viewing and testing. They are created in a folder whose name ends in _debug
.
- Verify the built datasets
To compare the checksums of your generated with the originals, run
python check.py [--debug]
The --debug
flag checks the files created using the debug flag in step 2.
If you don't get 'OK' for all the files... uhh, for now, ask Neha what to do about it
Output files are in JSON format. You should see this file structure:
iclr-discourse-dataset
│
└─── review_rebuttal_pair_dataset/
│ │ unstructured.json
│ │ traindev_train.json
│ │ traindev_dev.json
│ │ traindev_test.json
│ │ truetest.json
|
└─── review_rebuttal_pair_dataset_debug/ # if you ran with --debug as well
│ │ unstructured.json # These files will be much smaller
│ │ traindev_train.json
│ │ traindev_dev.json
│ │ traindev_test.json
│ │ truetest.json
│
| ... other ...
│ ... stuff ...
- Unstructured: unstructured text from reviews, rebuttals and abstracts, in ICLR 2018, for use in domain pre-training à la Don't Stop Pretraining
- Truetest: (20% of all) review-rebuttal pairs from ICLR 2020, to be used as an unseen test set
- Traindev: review-rebuttal pairs from ICLR 2019 in a traditional train/dev/test split. (3:1:1)
Each file has the following fields:
conference
: Which ICLR conference the examples are drawn fromsplit
: which split this data is from, out of unstructured/traindev/truetestsubsplit
: train, dev, or testreview_rebuttal_pairs
: a list of review-rebuttal pairs
Each review-rebuttal pair has the following fields:
index
: index within datasetreview_sid
: 'super id' (id of first comment) in reviewrebuttal_sid
: 'super id' (id of first comment) in rebuttalreview_text
: review text in chunksrebuttal_text
: rebuttal text in chunkstitle
: paper titlereview_author
: id of the reviewer, e.g. "AnonReviewer1"forum
: unique 'forum' id from OpenReview API -- identifies the paperlabels
: categorical labels where available, e.g. review rating, reviewer confidence
Text is represented as a list of list of lists:
The top level lists represent chunks (‘paragraphs’ separated by newlines). Each chunk is a list of sentences, and each sentence is a list of tokens. Sentence splitting and tokenizing is carried out by the CoreNLP pipeline.