Follow the preprocessing steps of Speechformer to preprocess the MuST-C data. The models trained on multilingual source audio described in the paper have been created with the following scripts:
- Base
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
--train-subset $train_sets --valid-subset $dev_set \
--save-dir $st_save_dir \
--num-workers 4 --max-update 100000 --patience 5 \
--max-tokens 5000 \
--user-dir examples/speech_to_text \
--task speech_to_text --config-yaml $config_st \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--arch s2t_transformer_m_fbk \
--distance-penalty log \
--optimizer adam \
--lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err
- Triangle
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
--train-subset $train_sets --valid-subset $dev_set \
--save-dir $st_save_dir \
--num-workers 2 --max-update 100000 --patience 5 \
--max-tokens 5000 \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml $config_st \
--criterion cross_entropy_dualdecoder --label-smoothing 0.1 \
--arch s2t_transformer_triangle_m \
--distance-penalty log \
--optimizer adam --find-unused-parameters \
--lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err
- Triangle λasr = 0.8, λst = 0.2
# To be run on 4 GPUs (K80)
fairseq-train $data_root \
--train-subset $train_sets --valid-subset $dev_set \
--save-dir $st_save_dir \
--num-workers 4 --max-update 100000 --patience 5 \
--max-tokens 5000 \
--user-dir examples/speech_to_text \
--task speech_to_text_ctc --config-yaml $config_st \
--criterion cross_entropy_dualdecoder --label-smoothing 0.1 \
--arch s2t_transformer_triangle_m \
--distance-penalty log \
--optimizer adam --find-unused-parameters \
--lr 2e-3 --lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--auxiliary-loss-weight 0.8 --primary-loss-weight 0.2 \
--seed 9 --update-freq 16 --load-pretrained-encoder-from $asr_pretrained \
--skip-invalid-size-inputs-valid-test > ${st_save_dir}train.log 2> ${st_save_dir}train.err
The output of the triangle models can be obtained using this script:
python examples/speech_to_text/generate_dualdecoder.py $data_bin \
--user-dir examples/speech_to_text \
--config-yaml $conf_yaml --gen-subset $split \
--max-tokens 10000 --model-overrides "{'load_pretrained_encoder_from':None}" \
--beam 10 \
--path $model_path \
--max-source-positions 10000 --max-target-positions 1000 \
--task speech_translation_dualdecoding > $out_path
while for the base model, it can be obtained running:
python fairseq_cli/generate.py $data_bin \
--user-dir examples/speech_to_text \
--config-yaml $conf_yaml --gen-subset $split \
--max-tokens 10000 \
--beam 5 \
--path $model_path \
--max-source-positions 10000 --max-target-positions 1000 \
--task speech_to_text > $out_path
We here release the pre-trained models that we used in our experiments. The models correspond to those reported in Table 7 of the paper. For each language pair we release the Sentencepiece dictionaries, the Fairseq configuration files, and the checkpoints.
dictionaries | config | base | triangle | triangle_08 |
---|---|---|---|---|
all-es | all-es | all-es | all-es | all-es |
all-fr | all-fr | all-fr | all-fr | all-fr |
all-it | all-it | all-it | all-it | all-it |
@inproceedings{gaido-etal-2022-who,
title = {{Who Are We Talking About? Handling Person Names in Speech Translation}},
author = "Gaido, Marco and Negri, Matteo and Turchi, Marco",
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics"
}