Skip to content

Latest commit

 

History

History

afs_speech_translation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Adaptive Feature Selection for End-to-End Speech Translation, EMNLP2020 Findings

This paper targets at improving end-to-end speech translation by improving the quality of speech features through feature selection. We argue that speech signals are often noisy and lengthy, involving large amount of redundant signals contributing little to speech recognition, thus also speech translation. Our solution is to eschew those transcript-irrelevant features such that speech translation models could access more meaningful speech signals, easing the learning of speech-target translation correspondence/alignment.

We propose adaptive feature selection, based on L0Drop, that learns to route information through a subset of speech features to support speech tasks. The learning process is automatic, with some hyperparameter controlling the degree of sparsity induced. Figure below shows the training procedure with AFS:

And the example below shows the position of selected features used for speech translation:

In our experiments, we observe substantial BLEU improvement compared against an ASR-pretrained ST baseline, where our method filters out ~85% speech features (~1.4x decoding speedup as a by-product).

In short, our work demonstrates that E2E ST suffers from redundant speech features, with sparsification bringing significant performance improvements. The E2E ST task offers new opportunities for follow-up research in sparse models to deliver performance gains, apart from enhancing efficiency and/or interpretability.

Model Training & Evaluation

Please go to the speech_translation branch for more details, where we provide an example for training/evaluation.

Performance and Download

We provide pretrained models for MuST-C En-De and LibriSpeech En-Fr. We also provide our models' translation for each test set.

Results on MuST-C

  • BLEU Score and sparsity on MuST-C corpus. Our model outperforms baselines substantially.
Metric Model De Es Fr It Nl Pt Ro Ru
BLEU ST 17.44 23.85 28.43 19.54 21.23 22.55 17.66 12.10
ST+ASR-PT 20.67 25.96 32.24 20.84 23.27 24.83 19.94 13.96
ST+AFS-t 21.57 26.78 33.34 23.08 24.68 26.13 21.73 15.10
ST+AFS-tf 22.38 27.04 33.43 23.35 25.05 26.55 21.87 14.92
Sparsity Rate ST+AFS-t 84.4% 84.5% 83.2% 84.9% 84.4% 84.4% 84.7% 84.2%
ST+AFS-tf 85.1% 84.5% 84.7% 84.9% 83.5% 85.1% 84.8% 84.7%
  • We offer models' translations to ease direct comparision for follow-up studies.
Model De Es Fr It Nl Pt Ro Ru
ST txt txt txt txt txt txt txt txt
ST+ASR-PT txt txt txt txt txt txt txt txt
ST+AFS-t txt txt txt txt txt txt txt txt
ST+AFS-tf txt txt txt txt txt txt txt txt
Model MuST-C EnDe
ST model
ST+ASR-PT model
ST+AFS-t model
ST+AFS-tf model

Results on LibriSpeech En-Fr

Similar to MuST-C, we provide preprocessed dataset, ~16G, translation performance, translation output and pretrained models.

Model LibriSpeech EnFr
ST 14.32 txt model
ST+ASR-PT 17.05 txt model
ST+AFS-t 18.33 txt model
ST+AFS-tf 18.56 txt model

Please go to AFS for E2E ST for more details.

Citation

Please consider cite our paper as follows:

Biao Zhang; Ivan Titov; Barry Haddow; Rico Sennrich (2020). Adaptive Feature Selection for End-to-End Speech Translation. In Findings of the Association for Computational Linguistics: EMNLP 2020.

@inproceedings{zhang-etal-2020-adaptive,
    title = "Adaptive Feature Selection for End-to-End Speech Translation",
    author = "Zhang, Biao  and
      Titov, Ivan  and
      Haddow, Barry  and
      Sennrich, Rico",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.230",
    doi = "10.18653/v1/2020.findings-emnlp.230",
    pages = "2533--2544",
    abstract = "Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out {\textasciitilde}84{\%} temporal features, yielding an average translation gain of {\textasciitilde}1.3-1.6 BLEU and a decoding speedup of {\textasciitilde}1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation).",
}