- paper link
- source code is given in the speech_translation branch
This paper targets at improving end-to-end speech translation by improving the quality of speech features through feature selection. We argue that speech signals are often noisy and lengthy, involving large amount of redundant signals contributing little to speech recognition, thus also speech translation. Our solution is to eschew those transcript-irrelevant features such that speech translation models could access more meaningful speech signals, easing the learning of speech-target translation correspondence/alignment.
We propose adaptive feature selection
, based on L0Drop, that learns to route information
through a subset of speech features to support speech tasks. The learning process is automatic, with some hyperparameter
controlling the degree of sparsity induced. Figure below shows the training procedure with AFS:
And the example below shows the position of selected features used for speech translation:
In our experiments, we observe substantial BLEU improvement compared against an ASR-pretrained ST baseline, where our method filters out ~85% speech features (~1.4x decoding speedup as a by-product).
In short, our work demonstrates that E2E ST suffers from redundant speech features, with sparsification bringing significant performance improvements. The E2E ST task offers new opportunities for follow-up research in sparse models to deliver performance gains, apart from enhancing efficiency and/or interpretability.
Please go to the speech_translation branch for more details, where we provide an example for training/evaluation.
We provide pretrained models for MuST-C En-De and LibriSpeech En-Fr. We also provide our models' translation for each test set.
- BLEU Score and sparsity on MuST-C corpus. Our model outperforms baselines substantially.
Metric | Model | De | Es | Fr | It | Nl | Pt | Ro | Ru |
---|---|---|---|---|---|---|---|---|---|
BLEU | ST | 17.44 | 23.85 | 28.43 | 19.54 | 21.23 | 22.55 | 17.66 | 12.10 |
ST+ASR-PT | 20.67 | 25.96 | 32.24 | 20.84 | 23.27 | 24.83 | 19.94 | 13.96 | |
ST+AFS-t | 21.57 | 26.78 | 33.34 | 23.08 | 24.68 | 26.13 | 21.73 | 15.10 | |
ST+AFS-tf | 22.38 | 27.04 | 33.43 | 23.35 | 25.05 | 26.55 | 21.87 | 14.92 | |
Sparsity Rate | ST+AFS-t | 84.4% | 84.5% | 83.2% | 84.9% | 84.4% | 84.4% | 84.7% | 84.2% |
ST+AFS-tf | 85.1% | 84.5% | 84.7% | 84.9% | 83.5% | 85.1% | 84.8% | 84.7% |
- We offer models' translations to ease direct comparision for follow-up studies.
Model | De | Es | Fr | It | Nl | Pt | Ro | Ru |
---|---|---|---|---|---|---|---|---|
ST | txt | txt | txt | txt | txt | txt | txt | txt |
ST+ASR-PT | txt | txt | txt | txt | txt | txt | txt | txt |
ST+AFS-t | txt | txt | txt | txt | txt | txt | txt | txt |
ST+AFS-tf | txt | txt | txt | txt | txt | txt | txt | txt |
- For MuST-C En-De, we also provide the preprocessed dataset, very large ~66G for downloading. Besides, we provide the trained models below.
Model | MuST-C EnDe |
---|---|
ST | model |
ST+ASR-PT | model |
ST+AFS-t | model |
ST+AFS-tf | model |
Similar to MuST-C, we provide preprocessed dataset, ~16G, translation performance, translation output and pretrained models.
Model | LibriSpeech EnFr |
---|---|
ST | 14.32 txt model |
ST+ASR-PT | 17.05 txt model |
ST+AFS-t | 18.33 txt model |
ST+AFS-tf | 18.56 txt model |
Please go to AFS for E2E ST for more details.
Please consider cite our paper as follows:
Biao Zhang; Ivan Titov; Barry Haddow; Rico Sennrich (2020). Adaptive Feature Selection for End-to-End Speech Translation. In Findings of the Association for Computational Linguistics: EMNLP 2020.
@inproceedings{zhang-etal-2020-adaptive,
title = "Adaptive Feature Selection for End-to-End Speech Translation",
author = "Zhang, Biao and
Titov, Ivan and
Haddow, Barry and
Sennrich, Rico",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.findings-emnlp.230",
doi = "10.18653/v1/2020.findings-emnlp.230",
pages = "2533--2544",
abstract = "Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to ASR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech EnFr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out {\textasciitilde}84{\%} temporal features, yielding an average translation gain of {\textasciitilde}1.3-1.6 BLEU and a decoding speedup of {\textasciitilde}1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation).",
}