GitHub - idiap/HAN_NMT: Document-Level Neural Machine Translation with Hierarchical Attention Networks

Description

Implementation of the paper "Document-Level Neural Machine Translation with Hierarchical Attention Networks". It is based on OpenNMT (v.2.1) https://github.com/OpenNMT/OpenNMT-py

This is a restricted version. It DOES NOT work for shards, and multimodal translation.

Preprocess

The data, similary for any NMT baseline, consists of a source file and a target file which are aligned at sentence-level. However, the sentences should be in order for each document (i.e. not shuffled). Additionally, the model requires a file (doc_file) indicating the beginning of each document in the source file. Each line of the doc_file indicates the number of lines at the source file where a new document starts.

Example:

0
10
25

There are 3 documents. The first one from line 0 to line 9, the second from line 10 to 24, the third from line 25 to the end.

Command:

python preprocess.py -train_src [source_file] -train_tgt [target_file] -train_doc [doc_file] 
-valid_src [source_dev_file] -valid_tgt [target_dev_file] -valid_doc [doc_dev_file] -save_data [out_file]

The folder preprocess_TED_zh-en contains the files to preprocess the TED Talks zh-en dataset from https://wit3.fbk.eu/mt.php?release=2015-01.

Training

Training the sentence-level NMT baseline:

python train.py -data [data_set] -save_model [sentence_level_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 4096 -start_decay_at 20 -report_every 500 -epochs 20 -gpuid 0 -max_generator_batches 16 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot 
-train_part sentences

Training HAN-encoder using the sentence-level NMT model:

python train.py -data [data_set] -save_model [HAN_enc_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot 
-train_part all -context_type HAN_enc -context_size 3 -train_from [sentence_level_model]

Training HAN-decoder using the sentence-level NMT model:

python train.py -data [data_set] -save_model [HAN_dec_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot 
-train_part all -context_type HAN_dec -context_size 3 -train_from [sentence_level_model]

Training HAN-joint using the HAN-encoder model:

python train.py -data [data_set] -save_model [HAN_joint_model] -encoder_type transformer -decoder_type transformer -enc_layers 6 -dec_layers 6 -label_smoothing 0.1 -src_word_vec_size 512 -tgt_word_vec_size 512 -rnn_size 512 -position_encoding -dropout 0.1 -batch_size 1024 -start_decay_at 2 -report_every 500 -epochs 1 -gpuid 0 -max_generator_batches 32 -batch_type tokens -normalization tokens -accum_count 4 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot 
-train_part all -context_type HAN_join -context_size 3 -train_from [HAN_enc_model]

Input options:

train_part: [sentences, context, all]
context_type: [HAN_enc, HAN_dec, HAN_join, HAN_dec_source, HAN_dec_context]
context_size: number of previous sentences

NOTE: The transformer model is sensitive to variation on hyperparameters. The HAN is also sensitive to the batch size.

Translation

The translation is done sentence by sentence despite not being necesary for HAN_enc or baseline (this could be improved).

python translate.py -model [model] -src [test_source_file] -doc [test_doc_file] 
-output [out_file] -translate_part all -batch_size 1000 -gpu 0

Input options:

translate_part: [sentences, all]
batch_size: maximun number of sentences to keep in memory at once.

Test files reported in the paper

The output files of the 3 reported systems: transformer NMT, cache NMT, HAN-decoder NMT, HAN-encoder NMT, HAN-encoder-decoder NMT.

sub_es-en: Opensubtitles

sub_zh-en: TV subtitles

TED_es-en: TED Talks WIT 2015

TED_zh-en: TED Talks WIT 2014

Reference:

Miculicich, L., Ram, D., Pappas, N. & Henderson, J. Document-Level Neural Machine Translation with Hierarchical Attention Networks. EMNLP 2018. https://www.aclweb.org/anthology/D18-1325/

Contact:

lmiculicich@idiap.ch

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
full_source		full_source
preprocess_TED_zh-en		preprocess_TED_zh-en
source		source
test_out		test_out
COPYING		COPYING
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

full_source

full_source

preprocess_TED_zh-en

preprocess_TED_zh-en

source

source

test_out

test_out

COPYING

COPYING

README.md

README.md

Repository files navigation

Description

Preprocess

Training

Translation

Test files reported in the paper

Reference:

Contact:

About

Releases

Packages

Contributors 2

Languages

License

idiap/HAN_NMT

Folders and files

Latest commit

History

Repository files navigation

Description

Preprocess

Training

Translation

Test files reported in the paper

Reference:

Contact:

About

Resources

License

Stars

Watchers

Forks

Languages