Skip to content

Latest commit

 

History

History
126 lines (114 loc) · 7.25 KB

SUBTITLE_SEGMENTER_AACL2022.md

File metadata and controls

126 lines (114 loc) · 7.25 KB

Multimodal Subtitle Segmenter (AACL 2022)

Code for the paper: Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora.

Pre-trained multimodal models

We here release the multilingual multimodal models with parallel attention (Figure 1 in the paper). The original model used in the paper is de_en_fr_it. We also release another version of the segmenter trained on all MuST-Cinema languages, which provides overall higher scores for the languages that are not covered by the de_en_fr_it model.

Results (Sigma-CPL%) de en es fr it nl pt ro
de_en_fr_it 86.4-89.1 88.2-94.6 81.2-89.3 86.7-93.3 85.5-89.3 81.2-89.0 81.4-89.3 75.3-83.3
all_langs 86.3-88.4 87.1-94.4 84.5-93.0 87.1-93.2 85.2-89.6 86.5-86.0 87.2-89.6 86.6-91.4

Preprocessing

Preprocess the MuST-Cinema dataset as already explained here. Then, run the following code:

for subset in train dev amara; do
        cut -f 5 ${DATA_ROOT}/en-${LANG}/${subset}_st_src.tsv > \
         ${DATA_ROOT}/en-${LANG}/${subset}.${lang}.multimod
        sed 's/<eob>//g; s/<eol>//g; s/  / /g; s/^ //g; s/ $//g' \
         ${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod > \
         ${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod.unsegm
        paste ${DATA_ROOT}/en-${LANG}/${subset}_st_src.tsv \
        ${DATA_ROOT}/en-${LANG}/${subset}.${LANG}.multimod.unsegm \
        | cut -f 1,2,3,5,6,7 > ${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv

        sed -i '1s/tgt_text$/src_text/g' ${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv
        done

where DATA_ROOT is the folder containing the preprocessed data, LANG is the language (en, de, fr, it for train, dev, and amara sets and es, nl only for the amara set). Lastly, add the target language as a tsv column to enable Fairseq-ST multiligual training/inference for each subset and for each language:

awk 'NR==1 {printf("%s\t%s\n", $0, "tgt_lang")}  NR>1 {printf("%s\t%s\n", $0, "'"${LANG}"'")}' \
  ${DATA_ROOT}/en-${LANG}/${subset}_multi_segm.tsv > ${DATA_ROOT}/${subset}_${LANG}_multi.tsv

To generate a unique SentencePiece model and, consequently, vocabulary for all the training languages (as we do in our paper), run the script below:

python ${FBK_fairseq}/examples/speech_to_text/scripts/gen_multilang_spm_vocab.py \
  --data-root ${DATA_ROOT} --save-dir ${DATA_ROOT} \
  --langs en,de,fr,it --splits train_en_multi,train_de_multi,train_fr_multi,train_it_multi \
  --vocab-type unigram --vocab-size 10000

Training

To train the multilingual multimodal model with parallel attention, run the code below:

python ${FBK_fairseq}/train.py ${DATA_ROOT} \
        --train-subset train_de_multi,train_en_multi,train_fr_multi,train_it_multi \
        --valid-subset dev_de_multi,dev_en_multi,dev_fr_multi,dev_it_multi \
        --save-dir ${SAVE_DIR} \
        --num-workers 2 --max-update 200000 \
        --max-tokens 40000 \
        --user-dir examples/speech_to_text \
        --task speech_to_text_multimodal --config-yaml ${CONFIG_YAML}  \
        --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
        --label-smoothing 0.1 \
        --arch s2t_transformer_dual_encoder_s \
        --ctc-encoder-layer 8 --ctc-compress-strategy avg --ctc-weight 0.5 \
        --context-encoder-layers 12 --decoder-layers 3 \
        --context-dropout 0.3 --context-ffn-embed-dim 1024 \
        --share-encoder-decoder-embed \
        --context-decoder-attention-type parallel \
        --optimizer adam --lr 1e-3 --lr-scheduler inverse_sqrt \
        --warmup-updates 10000 \
        --clip-norm 10.0 \
        --seed 1 --update-freq 4 \
        --patience 15 \
        --ignore-prefix-size 1 \
        --skip-invalid-size-inputs-valid-test \
        --log-format simple --find-unused-parameters

where FBK_fairseq is the folder of our repository, DATA_ROOT is the folder containing the preprocessed data, SAVE_DIR is the folder in which to save the checkpoints of the model, CONFIG_YAML is the path to the config yaml file.

This training setup is intended for 2 NVIDIA A40 48GB, please adjust --max-tokens and --update-freq such as max_tokens * update_freq * number of GPUs used for training = 320,000.

Generation

First, average the checkpoint as already explained in our repository here.

Second, run the code below:

python ${FBK_fairseq}/generate.py ${DATA_ROOT} \
      --config-yaml ${CONFIG_YAML} --gen-subset amara_${LANG}_multi \
      --task speech_to_text_multimodal \
      --criterion ctc_multi_loss --underlying-criterion label_smoothed_cross_entropy \
      --user-dir examples/speech_to_text \
      --path ${SAVE_DIR}/${CHECKPOINT_FILENAME} \
      --max-tokens 25000 --beam 5 --scoring sacrebleu \
      --results-path ${SAVE_DIR}

where LANG is the language selected for inference and CHECKPOINT_FILENAME is the file containing the average of the checkpoints of the previous step.

Please use sacrebleu to obtain BLEU scores and EvalSubtitle to obtain Sigma and CPL.

Citation

@inproceedings{papi-etal-2022-dodging,
    title = "Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented {ST} Corpora",
    author = "Papi, Sara  and
      Karakanta, Alina  and
      Negri, Matteo  and
      Turchi, Marco",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)",
    month = nov,
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-short.59",
    pages = "480--487",
    }