GitHub - amazon-science/efficient-longdoc-classification

Source codes for ``Efficient Classification of Long Documents Using Transformers''

Please refer to our paper for more details and cite our paper if you find this repo useful:

@inproceedings{park-etal-2022-efficient,
    title = "Efficient Classification of Long Documents Using Transformers",
    author = "Park, Hyunji  and
      Vyas, Yogarshi  and
      Shah, Kashif",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-short.79",
    doi = "10.18653/v1/2022.acl-short.79",
    pages = "702--709",
}

Instructions

1. Install required libraries

pip install -r requirements.txt
python -m spacy download en_core_web_sm

2. Prepare the datasets

Hyperpartisan News Detection

Available at https://zenodo.org/record/1489920#.YLferh1Olc8
Download the datasets

mkdir data/hyperpartisan
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/articles-training-byarticle-20181122.zip
wget -P data/hyperpartisan/ https://zenodo.org/record/1489920/files/ground-truth-training-byarticle-20181122.zip
unzip data/hyperpartisan/articles-training-byarticle-20181122.zip -d data/hyperpartisan
unzip data/hyperpartisan/ground-truth-training-byarticle-20181122.zip -d data/hyperpartisan
rm data/hyperpartisan/*zip

Prepare the datasets with the resulting xml files and this preprocessing script (following Longformer): https://github.com/allenai/longformer/blob/master/scripts/hp_preprocess.py

20NewsGroups

Originally available at http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
Running train.py with the --data 20news flag will download and prepare the data available via sklearn.datasets (following CogLTX). We adopt the train/dev/test split from this ToBERT paper.

EURLEX-57K

Available at https://github.com/iliaschalkidis/lmtc-emnlp2020
Download the datasets

mkdir data/EURLEX57K
wget -O data/EURLEX57K/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip
unzip data/EURLEX57K/datasets.zip -d data/EURLEX57K
rm data/EURLEX57K/datasets.zip
rm -rf data/EURLEX57K/__MACOSX
mv data/EURLEX57K/dataset/* data/EURLEX57K
rm -rf data/EURLEX57K/dataset
wget -O data/EURLEX57K/EURLEX57K.json http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/eurovoc_en.json

Running train.py with the --data eurlex flag reads and prepares the data from data/EURLEX57K/{train, dev, test}/*.json files
Running train.py with the --data eurlex --inverted flag creates Inverted EURLEX data by inverting the order of the sections
data/EURLEX57K/EURLEX57K.json contains label information.

CMU Book Summary Dataset

Available at http://www.cs.cmu.edu/~dbamman/booksummaries.html

wget -P data/ http://www.cs.cmu.edu/~dbamman/data/booksummaries.tar.gz
tar -xf data/booksummaries.tar.gz -C data

Running train.py with the --data books flag reads and prepares the data from data/booksummaries/booksummaries.txt
Running train.py with the --data books --pairs flag creates Paired Book Summary by combining pairs of summaries and their labels

3. Run the models

e.g. python train.py --model_name bertplusrandom --data books --pairs --batch_size 8 --epochs 20 --lr 3e-05

cf. Note that we use the source code for the CogLTX model: https://github.com/Sleepychord/CogLTX

Hyperparameters used

Hyperpartisan

Parameter	BERT	BERT+TextRank	BERT+Random	Longformer	ToBERT
Batch size	8	8	8	16	8
Epochs	20	20	20	20	20
LR	3e-05	3e-05	5e-05	5e-05	5e-05
Scheduler	NA	NA	NA	warmup	NA

20NewsGroups, Book Summary, Paired Book Summary

Parameter	BERT	BERT+TextRank	BERT+Random	Longformer	ToBERT
Batch size	8	8	8	16	8
Epochs	20	20	20	20	20
LR	3e-05	3e-05	3e-05	0.005	3e-05
Scheduler	NA	NA	NA	warmup	NA

EURLEX, Inverted EURLEX

Parameter	BERT	BERT+TextRank	BERT+Random	Longformer	ToBERT
Batch size	8	8	8	16	8
Epochs	20	20	20	20	20
LR	5e-05	5e-05	5e-05	0.005	5e-05
Scheduler	NA	NA	NA	warmup	NA

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

CONTRIBUTING.md

CONTRIBUTING.md

LICENSE

LICENSE

NOTICE

NOTICE

README.md

README.md

Repository files navigation

Source codes for ``Efficient Classification of Long Documents Using Transformers''

Instructions

1. Install required libraries

2. Prepare the datasets

Hyperpartisan News Detection

20NewsGroups

EURLEX-57K

CMU Book Summary Dataset

3. Run the models

Hyperparameters used

Hyperpartisan

20NewsGroups, Book Summary, Paired Book Summary

EURLEX, Inverted EURLEX

About

Releases

Packages

Contributors 2

Languages

License

amazon-science/efficient-longdoc-classification

Folders and files

Latest commit

History

Repository files navigation

Source codes for ``Efficient Classification of Long Documents Using Transformers''

Instructions

1. Install required libraries

2. Prepare the datasets

Hyperpartisan News Detection

20NewsGroups

EURLEX-57K

CMU Book Summary Dataset

3. Run the models

Hyperparameters used

Hyperpartisan

20NewsGroups, Book Summary, Paired Book Summary

EURLEX, Inverted EURLEX

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages