Bilingual Lexicon Inductionvia Unsupervised Bitext Construction and Word Alignment

Haoyue Shi, Luke Zettlemoyer and Sida I. Wang

Requirements

PyTorch >= 1.7
transformers == 4.0.0
fairseq (to run CRISS and extract CRISS-based features)
chinese_converter (to convert between simplfied and traditional Chinese, fitting the different settings of CRISS and MUSE)

See also env/env.yml for sufficient environment setup.

A Quick Example for the Pipeline of Lexicon Induction

Step 0: Download CRISS

The default setting assumes that the CRISS (3rd iteration) model is saved in criss/criss-3rd.pt.

Step 1: Unsupervised Bitext Construction with CRISS

Let's assume that we have the following bitext (sentences separated by " ||| ", one pair per line):

Das ist eine Katze . ||| This is a cat .
Das ist ein Hund . ||| This is a dog .

Step 2: Word Alignment with SimAlign

Note: we use CRISS as the backbone of SimAlign and use our own implmentation, you can also use other aligners---just make sure that the results are stored in a json file like follows:

{"inter": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]], "itermax": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]]}
{"inter": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]], "itermax": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]]}

where "inter" and "itermax" denote the argmax and itermax algorithm in SimAlign respectively. The output is in the same format as the json output of SimAlign. See the code of SimAlign for more details.

Step 3: Training and Testing Lexicon Inducer

Fully Unsupervised

python src/fully_unsup.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -te ./data/test.dict

Weakly Supervised

python src/weakly_sup.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -tr ./data/train.dict \
    -te ./data/test.dict \
    -src de_DE \
    -trg en_XX

You would probably also like to specify a model folder by -o $model_FOLDER to save the statistices of bitext and alignment (default ./model).

-src and -trg specify the source and target language, where for the languages and corresponding codes that CRISS supports, check the language pairs in this file.

You will see the final model (model.pt, lexicon inducer) and the induced lexicon (induced.weaklysup.dict/induced.fullyunsup.dict) in the model folder, as well as a line of evaluation result (on the test set) like follows:

{'oov_number': 0, 'oov_rate': 0.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}

A Quick Example for the MLP-Based Aligner

Training

Training an MLP-based aligner using the bitext and alignment shown above.

python align/train.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -src de_DE \
    -trg en_XX \
    -o model/

Testing

Testing the saved aligner on the same set (note: this is only used to show how the code works, and in real scenarios we test on a different dataset from the training set).

The -b and -a should be the same as those used for training, to avoid potential error (in fact, if you did not delete anything after training, the -b and -a parameters will never be actually used).

python align/test.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -src de_DE \
    -trg en_XX \
    -m model/

For CRISS-SimAlign baseline, you can run a quick evaluation of CRISS-based SimAlign the above examples for German--English alignment, using the argmax inference algorithm

python align/eval_simalign_criss.py

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
align		align
criss		criss
data		data
env		env
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

facebookresearch/bitext-lexind

Folders and files

Latest commit

History

Repository files navigation

Bilingual Lexicon Inductionvia Unsupervised Bitext Construction and Word Alignment

Requirements

A Quick Example for the Pipeline of Lexicon Induction

Step 0: Download CRISS

Step 1: Unsupervised Bitext Construction with CRISS

Step 2: Word Alignment with SimAlign

Step 3: Training and Testing Lexicon Inducer

Fully Unsupervised

Weakly Supervised

A Quick Example for the MLP-Based Aligner

Training

Testing

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages