The repository contains code for Universal Dependencies according to BERT: both more specific and more general
Our modification of the Universal Dependencies annotation is applied with UDApi. To install UDApi, follow the instruction from UDApi. We have created our custom block that performs conllu modifications, to use it:
- Clone the UDApi repository
- Copy the file
attentionconvert.pytoudapi-python/udapi/block/ud - Follow the steps in Install Udapi for developers for developers
- Run in a command line:
udapy read.Conllu files=<path-to-conllu> ud.AttentionConvert write.Conllu > <path-to-converted-conllu>
Note that this step is optional. However, it is necessary to reproduce our results.
The code and instruction for running BERT over text and extracting the resulting attention map were created by Kevin Clark and were adapted for this project. The original code is available at Attention Analysis Clark et al.
The input data should be a JSON file containing a list of dicts, each one corresponding to a single example to be passed into BERT. Each dict must contain exactly one of the following fields:
"text": A string."words": A list of strings. Needed if you want word-level rather than token-level attention."tokens": A list of strings corresponding to BERT wordpiece tokenization.
If the present field is "tokens," the script expects [CLS]/[SEP] tokens
to be already added; otherwise it adds these tokens to the
beginning/end of the text automatically.
Note that if an example is longer than max_sequence_length tokens
after BERT wordpiece tokenization, attention maps will not be extracted for it.
Attention extraction is run with
python attention-analysis-clark-etal/extract_attention.py --preprocessed-data-file <path-to-your-data> --bert_dir <directory-containing-BERT-model> --max-sequence-length 256
The following optional arguments can also be added:
--max_sequence_length: Maximum input sequence length after tokenization (default is 128).--batch_size: Batch size when running BERT over examples (default is 16).--debug: Use a tiny BERT model for fast debugging.--cased: Do not lowercase the input text.--word_level: Compute word-level instead of token-level attention (see Section 4.1 of the paper).
The list of attention matrices will be saved to <path-to-your-data>_attentions.npz. The file will be referred to as <path-to-attentions> in the next steps.
Wordpiece tokenized sentences will be saved to <path-to-your-data>_source.txt. The file will be referred to as <path-to-wordpieces> in the next steps.
Select syntactic head ensembles for each Universal Dependencies syntactic relation:
python3 head-ensembles/head_ensemble.py <attention-matrices> <bpe-tokenized-sentences> <path-to-conllu> -j <path-to-head-ensembles>
<attention-matrices> and <bpe-tokenized-sentences> were generated in the last step.
<conllu-file> is a path to conll file used for evaluation, that optionally was converted with UDApi before.
A dictionary is produced with syntactic labels as keys and head ensembles as values. Each head ensemble contains fields:
- ensemble: list of pairs [layer_index, head_index] of heads selected to the ensemble
- max_metric: metric result for the head ensemble on evaluation conllu (Dependency accuracy by default)
- metric_history: metric result in each step of the selection process
- max_ensemble_size: the limit of the number of heads in an ensemble
- relation_label: the same as a dictionary key
If the argument --json is provided the dictionary is saved in JSON format.
Other arguments for the script:
--metric: metric to optimize in head ensemble selection (currently only DepAcc is supported)--num-heads: the maximal size of each head ensemble (by default: 4)--sentences: indices of the sentences used for selection.
Construct dependency trees from head ensembles selected in the last step and evaluate their UAS and LAS on conllu file.
python head-ensembles/extract_trees.py <attention-matrices> <bpe-tokenized-sentences> <path-to-conllu> <path-to-head-ensmemble>
The results are printed to standard output. We use different conllu file for head ensemble selection (EuroParl with UD modification) and dependency tree (PUD w/o UD modifications)
Other arguments for the script:
--sentences: indices of the sentences used for selection.
-
Install required packages by pip. Follow instruction in Universal Dependencies Modification to install UDApi with our custom block.
-
Download conllu files from Universal Dependencies web site. For instance, for Japanese GSD train treebank
ja_gsd-ud-train.conllfor head selection andja_pud-ud-test.conllufor evaluation. Save the files toresourcesdirectory. -
Download BERT model from BERT GitHub to
<directory-containing-BERT-model>. Then extract the attention matrices by running bash script:source scripts/extract_attention.sh ja_gsd-ud-train ja_pud-ud-test <directory-containing-BERT-model> -
Run head selection and tree extraction by running bash script. Results will be saved in
resultsdirectory:source scripts/pipeline_eval.sh ja_gsd-ud-train ja_pud-ud-test
@misc{limisiewicz2020universal,
title={Universal Dependencies according to BERT: both more specific and more general},
author={Tomasz Limisiewicz and Rudolf Rosa and David Mare\v{c}ek},
year={2020},
eprint={2004.14620},
archivePrefix={arXiv},
primaryClass={cs.CL}
}