Skip to content

andreabac3/Word_Alignment_BERT

Repository files navigation

Cross-Lingual and Multilingual Word Alignment

PyTorch Code style: black -l 300

This project provides an API to perform word alignment.
The list of languages supported depends on the transformer architecture used.

How to use

See the main.py file to run an example

sentence1 = "Today I went to the supermarket to buy apples".split()
sentence2 = "Oggi io sono andato al supermercato a comprare le mele".split()
BERT_NAME = "bert-base-multilingual-cased"
wa = WordAlignment(model_name=BERT_NAME, tokenizer_name=BERT_NAME, device='cpu', fp16=False)
_, decoded = wa.get_alignment(sentence1, sentence2, calculate_decode=True)
for (sentence1_w, sentence2_w) in decoded:
    print(sentence1_w, "\t--->", sentence2_w)

Output:

Today           ---> Oggi
I               ---> io
went            ---> andato
to              ---> al
the             ---> al
supermarket     ---> supermercato
to              ---> a
buy             ---> comprare
apples          ---> mele

get_alignment api

The signature of the function is List[str], List[str], bool -> Tuple[List[int], List[List[str]]]
To speed up the computation you can avoid calculating the decoding posing the boolean value to False.
If calculate_decode is False the second value returned will be None.

FP16 Support

The WordAlignment support FP16 but we discourage their use.

How to install

The Word Alignment is fully compatible with NVIDIA CUDA.
To use CUDA you have to install the CUDA version of Torch-Scatter lib, I made a simple script to automate it

bash cuda_install_requirements.sh

N.B.: The CUDA installation of Torch-Scatter require minutes to be compiled.

Dependencies

  • Python3
  • Torch
  • Transformers
  • Torch-Scatter

Authors