LearningFromDraft

Code associated to the paper https://arxiv.org/abs/1710.01789

Set-up

The system requirement is specified in CSLT NMT Toolkit. Please follow the instruction to set up the environment.

Training and testing

Introduction

The training, as specified in the paper, conducts in a two-step manner:
1. The first step is to train a vanilla NMT model (folder part 1) and generate a draft;
2. The second step is to train modified NMT model (folder part 2) where the input is composed of source language sentences and the draft generated from step 1.
Accordingly, testing also follows this two-step fashion, i.e., first generate a draft using the model in step 1; then generate the final output using the model in step 2 with source language and draft as input.

Step-by-step

Clone the repository to your working directory.

git clone https://github.com/aodongli/LearningFromDraft.git

Train

We provide a Chinese-English dataset in './data', with 44,000+ sentences in training set, 1,000+ sentences in development/testing set. The training data corresponds to the small-scale IWSLT dataset mentioned in the paper, but the test data is only part of it.

Train a vanilla NMT model using default parameter setting. For detailed description of parameter setting, please refer to this manual.

cd part1
python translate.py

Suppose you have successfully trained a model and saved it at 10000th checkpoint (What is a checkpoint?). Then you need to use it to generate the translation draft for step 2.

python translate.py --model translate.ckpt-10000 --decode --beam_size 5 < ../data/train.src > ../data/train.src_2

Note that the name of the input file and output file is a MUST.

As described in the paper, we use constant word embedding for step 2. Here we extract the embedding from the model we just trained.

python translate.py --var_extract --model translate.ckpt-10000

If successful, emb.src and emb.trg will appear in the main directory.

Congratulations! You have done the preparation work for step 2. Do the similar fashion to train part 2.

cd part2
python translate.py

Test

Similarly, generate a translation draft of your test sentence.

cd part1
python translate.py --model translate.ckpt-10000 --decode --beam_size 5 < ../data/dev.src > ../data/dev.src_2

The input format for part2 decode is

src_sentence_1
draft_1
src_sentence_2
draft_2
...

To generate such a format file, use command paste in Linux/Mac. For example,

paste -d \\n ./data/dev.src ./data/dev.src_2 > ./data/mix_dev.src

Test the result.

cd part2
python ./translate.py --model translate.ckpt-10000 --decode --beam_size 5 < ../data/mix_dev.src > ../data/mix_dev_trans
cd ..
perl ./multi-bleu.perl ./data/mix_dev_trans < ./data/dev.trg

Contact

If you have questions, suggestions and bug reports, please email stamdlee@outlook.com.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
part1		part1
part2		part2
train_1		train_1
train_2		train_2
README.md		README.md
multi-bleu.perl		multi-bleu.perl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

part1

part1

part2

part2

train_1

train_1

train_2

train_2

README.md

README.md

multi-bleu.perl

multi-bleu.perl

Repository files navigation

LearningFromDraft

Code associated to the paper https://arxiv.org/abs/1710.01789

Set-up

Training and testing

Introduction

Step-by-step

Train

Test

Contact

About

Releases

Packages

Languages

aodongli/LearningFromDraft

Folders and files

Latest commit

History

Repository files navigation

LearningFromDraft

Code associated to the paper https://arxiv.org/abs/1710.01789

Set-up

Training and testing

Introduction

Step-by-step

Train

Test

Contact

About

Resources

Stars

Watchers

Forks

Languages