Skip to content

Latest commit

 

History

History
129 lines (109 loc) · 5.73 KB

PERSON_NAMES_IWSLT2022.md

File metadata and controls

129 lines (109 loc) · 5.73 KB

Person Names Translation

Training

Follow the preprocessing steps of Speechformer to preprocess the MuST-C data. The models trained on multilingual source audio described in the paper have been created with the following scripts:

  • Base
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
	--train-subset $train_sets --valid-subset $dev_set \
	--save-dir $st_save_dir \
	--num-workers 4 --max-update 100000 --patience 5 \
	--max-tokens 5000 \
	--user-dir examples/speech_to_text \
	--task speech_to_text --config-yaml $config_st \
	--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
	--arch s2t_transformer_m_fbk \
	--distance-penalty log \
	--optimizer adam \
	--lr 2e-3 --lr-scheduler inverse_sqrt \
	--warmup-updates 10000 \
	--clip-norm 10.0 \
	--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
	--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err
  • Triangle
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
	--train-subset $train_sets --valid-subset $dev_set \
	--save-dir $st_save_dir \
	--num-workers 2 --max-update 100000 --patience 5 \
	--max-tokens 5000 \
	--user-dir examples/speech_to_text \
	--task speech_to_text_ctc --config-yaml $config_st \
	--criterion cross_entropy_dualdecoder --label-smoothing 0.1 \
	--arch s2t_transformer_triangle_m \
	--distance-penalty log \
	--optimizer adam --find-unused-parameters \
	--lr 2e-3 --lr-scheduler inverse_sqrt \
	--warmup-updates 10000 \
	--clip-norm 10.0 \
	--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
	--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err
  • Triangle λasr = 0.8, λst = 0.2
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
	--train-subset $train_sets --valid-subset $dev_set \
	--save-dir $st_save_dir \
	--num-workers 4 --max-update 100000 --patience 5 \
	--max-tokens 5000 \
	--user-dir examples/speech_to_text \
	--task speech_to_text_ctc --config-yaml $config_st \
	--criterion cross_entropy_dualdecoder --label-smoothing 0.1 \
	--arch s2t_transformer_triangle_m \
	--distance-penalty log \
	--optimizer adam --find-unused-parameters \
	--lr 2e-3 --lr-scheduler inverse_sqrt \
	--warmup-updates 10000 \
	--clip-norm 10.0 \
	--auxiliary-loss-weight 0.8 --primary-loss-weight 0.2 \
	--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
	--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err

Inference

The output of the triangle models can be obtained using this script:

python examples/speech_to_text/generate_dualdecoder.py $data_bin \
    --user-dir examples/speech_to_text \
    --config-yaml $conf_yaml --gen-subset $split \
    --max-tokens 10000 --model-overrides "{'load_pretrained_encoder_from':None}" \
    --beam 10 \
    --path $model_path \
    --max-source-positions 10000 --max-target-positions 1000 \
    --task speech_translation_dualdecoding  > $out_path

while for the base model, it can be obtained running:

python fairseq_cli/generate.py $data_bin \
    --user-dir examples/speech_to_text \
    --config-yaml $conf_yaml --gen-subset $split \
    --max-tokens 10000 \
    --beam 5 \
    --path $model_path \
    --max-source-positions 10000 --max-target-positions 1000 \
    --task speech_to_text  > $out_path

Models

We here release the pre-trained models that we used in our experiments. The models correspond to those reported in Table 7 of the paper. For each language pair we release the Sentencepiece dictionaries, the Fairseq configuration files, and the checkpoints.

dictionaries config base triangle triangle_08
all-es all-es all-es all-es all-es
all-fr all-fr all-fr all-fr all-fr
all-it all-it all-it all-it all-it

Citation

@inproceedings{gaido-etal-2022-who,
    title = {{Who Are We Talking About? Handling Person Names in Speech Translation}},
    author = "Gaido, Marco  and Negri, Matteo  and Turchi, Marco",
    booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics"
}