Skip to content

Latest commit

 

History

History
143 lines (109 loc) · 7.92 KB

README.md

File metadata and controls

143 lines (109 loc) · 7.92 KB

Hierarchical Attention Transformers (HATs)

Implementation of Hierarchical Attention Transformers (HATs) presented in "An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification" of Chalkidis et al. (2022). HAT use a hierarchical attention scheme, which is a combination of segment-wise and cross-segment attention operations. You can think segments as paragraphs or sentences.

Citation

If you use HAT in your research, please cite:

An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification. Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022. arXiv:2210.05529 (Preprint).

@misc{chalkidis-etal-2022-hat,
  url = {https://arxiv.org/abs/2210.05529},
  author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
  title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
  publisher = {arXiv},
  year = {2022},
}

Implementation Details

The repository supports several variants of the HAT architecture. The implementation of HAT is build on top of HuggingFace Transformers in Torch. The implementations is available at models/hat/modelling_hat.py. The layout of stacking segment-wise (SW) and cross-segment (CS) encoders is specified in the configuration file with the encoder_layout parameter.

  • Ad-Hoc (AH): An ad-hoc (partially pre-trained) HAT comprises an initial stack of shared L-SWE segment encoders from a pre-trained transformer-based model, followed by L-CSE ad-hoc segment-wise encoders. In this case the model initially encodes and contextualize token representations per segment, and then builds higher-order segment-level representationse.g., a 6-layer model has 12 effective transformer blocks (Layout: S/S/S/S/S/S/S/S/D/D/D/D).
"encoder_layout": {
"0": {"sentence_encoder": true, "document_encoder":  false},
"1": {"sentence_encoder": true, "document_encoder":  false},
"2": {"sentence_encoder": true, "document_encoder":  false},
"3": {"sentence_encoder": true, "document_encoder":  false},
"4": {"sentence_encoder": true, "document_encoder":  false},
"5": {"sentence_encoder": true, "document_encoder":  false},
"6": {"sentence_encoder": true, "document_encoder":  false},
"7": {"sentence_encoder": true, "document_encoder":  false},
"8": {"sentence_encoder": false, "document_encoder":  true},
"9": {"sentence_encoder": false, "document_encoder":  true},
"10": {"sentence_encoder": false, "document_encoder":  true},
"11": {"sentence_encoder": false, "document_encoder":  true}
}
  • Interleaved (I): An interleaved HAT comprises a stack of L paired segment-wise and cross-segment encoders. e.g., a 6-layer model has 12 effective transformer blocks (Layout: SD/SD/SD/SD/SD/SD).
"encoder_layout": {
"0": {"sentence_encoder": true, "document_encoder":  true},
"1": {"sentence_encoder": true, "document_encoder":  true},
"2": {"sentence_encoder": true, "document_encoder":  true},
"3": {"sentence_encoder": true, "document_encoder":  true},
"4": {"sentence_encoder": true, "document_encoder":  true},
"5": {"sentence_encoder": true, "document_encoder":  true}
}
  • Early-Contextualization (EC): n early-contextualized HAT comprises an initial stack of L-P paired segment-wise and cross-segment encoders, followed by a stack of L-SWE segment-wise encoders. In this case, cross-segment attention (contextualization) is only performed at the initial layers of the model,e.g., a 6-layer model and 8 effective transformer blocks (Layout: SD/SD/S/S/S/S).
"encoder_layout": {
"0": {"sentence_encoder": true, "document_encoder":  true},
"1": {"sentence_encoder": true, "document_encoder":  true},
"2": {"sentence_encoder": true, "document_encoder":  false},
"3": {"sentence_encoder": true, "document_encoder":  false},
"4": {"sentence_encoder": true, "document_encoder":  false},
"5": {"sentence_encoder": true, "document_encoder":  false}
}
  • Late-Contextualization (LC): A late-contextualized HAT comprises an initial stack of $L_{\mathrm{SWE}}$ segment-wise encoders, followed by a stack of $L_{\mathrm{P}}$ paired segment and segment-wise encoders. In this case, cross-segment attention (contextualization) is only performed in the latter layers of the model, e.g., a 6-layer model and 8 effective transformer blocks (Layout: S/S/S/S/SD/SD).
"encoder_layout": {
"0": {"sentence_encoder": true, "document_encoder":  false},
"1": {"sentence_encoder": true, "document_encoder":  false},
"2": {"sentence_encoder": true, "document_encoder":  false},
"3": {"sentence_encoder": true, "document_encoder":  false},
"4": {"sentence_encoder": true, "document_encoder":  true},
"5": {"sentence_encoder": true, "document_encoder":  true}
}

In thi study, we examine the efficacy of 8 alternative layouts:

{
'I1': 'SD|SD|SD|SD|SD|SD',
'I2': 'S|SD|D|S|SD|D|S|SD|D',
'I3': 'S|SD|S|SD|S|SD|S|SD',
'LC1': 'S|S|S|S|S|S|SD|SD|SD',
'LC2': 'S|S|S|S|S|SD|D|S|SD|D',
'EC1': 'S|S|SD|D|S|SD|D|S|S|S',
'EC2': 'S|S|SD|SD|SD|S|S|S|S',
'AH':  'S|S|S|S|S|S|S|S|S|S|S|S',
}

Available Models on HuggingFace Hub

Model Name Layers Hidden Units Attention Heads Vocab Parameters
kiddothe2b/hierarchical-transformer-base-4096 16 768 12 50K 152M
kiddothe2b/longformer-base-4096 12 768 12 50K 152M
kiddothe2b/adhoc-hierarchical-transformer-base-4096 16 768 12 50K 140M
kiddothe2b/adhoc-hierarchical-transformer-I1-mini-1024 12 256 4 32K 18M
kiddothe2b/adhoc-hierarchical-transformer-I3-mini-1024 12 256 4 32K 18M
kiddothe2b/adhoc-hierarchical-transformer-LC1-mini-1024 12 256 4 32K 18M
kiddothe2b/adhoc-hierarchical-transformer-EC2-mini-1024 12 256 4 32K 18M
kiddothe2b/longformer-mini-1024 6 256 4 32K 14M

Requirements

Make sure that all required packages are installed:

torch>=1.11.0
transformers>=4.18.0
datasets>=2.0.0
tokenizers>=0.11.0
scikit-learn>=1.0.0
tqdm>=4.62.0
nltk>=3.7.0

How to run experiments?

You can use the shell scripts provided in the running_scripts directory to pre-train new models or fine-tune the ones released.

Try on Google Colab: https://colab.research.google.com/drive/15feh49wqBshgkcvbO6QypvJoa3dG6P5S?usp=sharing

I still have open questions...

Please post your question on Discussions section or communicate with the corresponding author via e-mail.